Non-intrusive speech quality estimation as combination of estimates using multiple time-scale auditory features

Rajesh Kumar Dubey,Arun Kumar

doi:10.1016/j.dsp.2017.07.020

Abstract

The human auditory system is modeled by different auditory models representing the distribution of speech sound energy in different channels across the cochlea using filter-banks of different bandwidths. In previous algorithms of non-intrusive speech quality evaluation, auditory features are determined using these auditory models on per frame basis and then averaged over the entire speech utterance. In these approaches, the effect of impulsive noise and other non-stationary noise effects get averaged over the utterance. To include the variations in the features of speech over time in the speech utterance, a multiple time-scale features approach has been proposed as the speech features vary from frame to frame that accounts for variation of noise characteristics over the speech utterance and thus its affect on quality mapping. In this work, non-intrusive speech quality evaluation has been done as an optimal linear combination of quality mapping called objective mean opinion score (MOS), computed using multiple time-scale estimates of features. The objective MOS of each of the multiple time-scale estimates (the combination of multiple active speeches) are obtained using a probabilistic approach. The overall objective MOS of the speech utterance is computed by taking the optimal linear combination of the estimated objective MOS using multiple time-scale estimates of features, where the optimality is based on the minimum mean square error (MMSE) criterion or correlation maximization criterion. The results are given in terms of Pearson's correlation coefficient and root mean square error (RMSE) between the subjective MOS and the estimated overall objective MOS for three different standard databases. The results have been compared with a single time-scale features approach, the ITU-T Recommendation P.563 and recent algorithms.

Full Text