Towards real-world objective speech quality and intelligibility assessment using speech-enhancement residuals and convolutional long short-term memory networks.

Xuan Dong,Donald S Williamson

doi:10.1121/10.0002702

Abstract

Objective metrics, such as the perceptual evaluation of speech quality (PESQ), short-time objective intelligibility (STOI), and signal-to-distortion ratio (SDR), are often used for evaluating speech. These metrics are intrusive since they require a reference (clean) speech signal to complete the evaluation. The need for a reference signal reduces the practicality of these metrics, since a clean reference signal is not typically available during real-world testing. In this paper, a two-stage approach is presented that estimates the objective score of these intrusive metrics in a non-intrusive manner, which enables testing in real-world environments. More specifically, objective score estimation is treated as a machine-learning problem, and the use of speech-enhancement residuals and convolutional long short-term memory (SER-CL) networks is proposed to blindly estimate the objective scores (i.e., PESQ, STOI, and SDR) of various speech signals. The approach is evaluated in simulated and real environments that contain different combinations of noise and reverberation. The results reveal that the proposed approach is a reasonable alternative for evaluating speech, where it performs well in terms of accuracy and correlation. The proposed approach also outperforms comparison approaches in several environments.

Full Text