Enhancing the correlation between the quality and intelligibility objective metrics with the subjective scores by shallow feed forward neural network for time–frequency masking speech separation algorithms

Sania Gul,Muhammad Salman Khan,Syed Waqar Shah,Sheheryar Sheheryar,Néstor Becerra Yoma

doi:10.1016/j.apacoust.2021.108539

Abstract

Multiple objective metrics are in use by the researchers to evaluate the performance of separation systems. In this paper, we investigate the correlation of different state-of-the-art widely used objective evaluation metrics with the subjective evaluation for different time–frequency (TF) masking based binaural speech separation algorithms and find the metrics which correlate best with the human perception of quality and intelligibility for such algorithms. We separate a speech source from a speech mixture by a speech separation algorithm, and evaluate its quality and intelligibility by using the objective evaluation metrics. Then we carry out the subjective listening test of this estimated speech source on a large number of participants. For each algorithm, we repeat this process for 10 mixtures and then find the correlation between the objective and average subjective score of all the participants by Pearson’s correlation coefficient and calculate the statistical significance of these results. We also rank the separation algorithms both by the subjective and the objective testing. The results show that none of the existing objective metrics for judging the quality and intelligibility of the speech separated by TF masking based binaural source separation algorithms correlates well with the human perception. We then use shallow feed forward (FF) classification and regression neural networks (NNs) to enhance this correlation, enabling almost 87% and 88% classification accuracy for the quality and intelligibility objective metrics respectively by using the classification NN, and 98% correlation with the average subjective quality scores by using the regression NN.

Full Text