Abstract

Many speech enhancement methods require perceptual quality metrics for evaluation. The “holy grail” of perceptual speech quality assessment is human subjective ratings, known as the mean opinion score. However, acquiring human ratings is time-consuming, laborious, and expensive. Existing objective quality metrics, on the other hand, are efficient and easy to compute but do not correlate well with human ratings. In this paper, we propose a relatively lightweight deep-learning-based model to predict the human ratings of speech signals. Since it is differentiable, it can be easily employed as a perceptual regularization to improve existing deep-learning-based speech enhancement methods. Experimental results demonstrate that the predictions of our proposed model correlate well with human judgments. We present application in speech enhancement and show that, interestingly, while there is a degradation in performance in terms of traditional objective metrics, there is a significant improvement in the perceptual quality and the naturalness of the enhanced speech.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call