Speech enhancement (SE) is a pivotal technology in enhancing the quality and intelligibility of speech signals. Nevertheless, when processing speech signals under conditions of high signal-to-noise ratio (SNR), conventional SE techniques may inadvertently lead to a diminution in the perceptual evaluation of speech quality (PESQ) and short-time objective intelligibility (STOI). This article introduces the innovative incorporation of the Non-Intrusive Speech Quality Assessment (NISQA) algorithm into SE systems. Through the comparison of pre and post-enhancement speech quality scores, it discerns whether the speech signal under consideration warrants enhancement processing, thereby mitigating potential deterioration in PESQ and STOI. Furthermore, this study delves into the ramifications of five prevalent speech features, namely, Mel Frequency Cepstral Coefficients (MFCC), Gammatone Frequency Cepstral Coefficients (GFCC), Relative Spectral Trans-formed Perceptual Linear Prediction coefficients (RASTA-PLP), Amplitude Modulation Spectrogram (AMS), and Multi-Resolution Cochleagram (MRCG), on PESQ and STOI under varying noise conditions. Experimental outcomes underscore that MRCG consistently emerges as the optimal and most stable feature for STOI, while the feature yielding the highest PESQ score exhibits intricate correlations with the background noise type, SNR level, and noise compatibility with the speech signal. Consequently, we propose an SE methodology founded on quality assessment and feature selection, facilitating the adaptive selection of optimal features tailored to distinct background noise scenarios, thereby always maintain the highest caliber enhancement effect with regard to PESQ metrics.
Read full abstract