Non-intrusive quality assessment of noise-suppressed speech using unsupervised deep features

Meet H Soni,Hemant A Patil

doi:10.1016/j.specom.2021.03.004

Abstract

Objective quality assessment aims towards evaluating the perceptual quality of a signal using a machine-based algorithm. Due to different challenges involved in the subjective evaluation of speech quality, it is necessary to develop objective measures. The goal of any non-intrusive quality assessment metric for noise-suppressed speech is to assess the quality of a noise-suppressed signal in the absence of any clean reference signal. As per the ITU-T P.835 recommendations, the quality assessment of noise-suppressed speech involves predicting three quality scores, namely, signal quality, background quality, and overall quality score, and hence, considered in this study. In recent literature, the non-intrusive quality assessment problem is presented as a regression problem, in which the mapping between a set of acoustic features and corresponding quality scores is found using a perceptual model. Recently, we proposed the use of Deep Autoencoder (DAE) features and Subband Autoencoder (SBAE) features for acoustic representation and an Artificial Neural Network (ANN) as a regression model. DAE and SBAE are variants of autoencoder architecture that have bottleneck structure in the hidden layers. Such architecture represents the class of generalized nonlinear Principal Component Analysis (PCA) that guarantees reconstruction of the input features with arbitrary accuracy. Both the features (DAE and SBAE) are extracted using unsupervised deep learning architectures, and they demonstrated better performance than the state-of-the-art spectral feature set, namely, Mel Filterbank Energies (FBEs). In this paper, we present more detailed analysis of previously proposed features, i.e., DAE and SBAE features, and analyze the usefulness of these features in predicting signal as well as background quality scores in addition to the overall quality score. We compare the performance of all the three features with each other as well as with current ITU-T P.563 metric for non-intrusive speech quality assessment. The results of our experiments performed on NOIZEUS database suggest that DAE and SBAE features perform relatively better than the FBEs while predicting signal and overall quality. On the other hand, FBE features perform slightly better than the DAE and SBAE features in predicting the background quality. Moreover, another major contribution of this paper is that we employ an ANN to predict all the three quality scores simultaneously, and present the results. It was observed that using this approach, it is possible to predict all the three scores simultaneously with similar accuracy as that of predicting them individually.

Full Text