Abstract

In non-intrusive speech quality assessment, original clean speech signal is not used as reference but only the received degraded speech is used for the quality estimation. The processing and perception of speech signals by human auditory system are captured in the perceptual linear prediction coefficients (PLP) and Mel frequency cepstral coefficients (MFCC) features. The line spectral frequencies (LSF) features carry intrinsic information of the formant structure of phoneme which is related to the resonance frequencies of the vocal tract of the speaker during articulation. The combination of PLP, MFCC and LSF features along with the subjective mean opinion score (MOS) of the speech utterances are used to train the joint Gaussian Mixture Model (GMM) by Expectation Maximization (EM) algorithm. The parameters of the joint GMM thus obtained and the combination of PLP, MFCC and LSF features are used to estimate the objective mean opinion score (MOS) of the speech utterances. The correlation of the subjective and the estimated objective MOS is obtained as figure of merit for the speech quality assessment algorithm. To show the efficacy of the method, the results in terms of correlation and root mean square error (RMSE) between the subjective and the estimated objective MOS are compared with ITU-T Recommendation P.563, standard for non-intrusive speech quality assessment on ITU-T supplement-23, NOIZEUS-960 and NOIZEUS-2240 databases.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call