Abstract

Output-based instrumental speech quality assessment relies only on the received (processed) signal to predict quality. Such methods are called non-intrusive and are crucial in speech applications where reference clean signals are not accessible. In this paper, we propose a new non-intrusive instrumental quality measure based on the similarity between two i-vectors. As the reference clean signal is not available, the reference i-vector representation cannot be extracted directly from it. Therefore, we propose the use of a clean speech Gaussian mixture model to estimate the clean speech spectra from its degraded speech spectrum counterpart. Next, the two respective i-vector representations are extracted and either the cosine or Eucledian similarity metrics are computed as a correlate of speech quality. Here, the clean speech model is trained using RASTA-filtered mel-frequency cepstral coefficients extracted from a pool of clean speech files, thus allowing us to attain a model of clean spectrum characteristics. The proposed method is evaluated on noisy, reverberant, and enhanced speech conditions. Experimental results show the proposed system providing higher correlations with perceptual speech quality than several benchmark non-intrusive measures, especially for noisy and enhanced speech.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call