An accurate objective prediction of human speech intelligibility is of interest for many applications such as the evaluation of signal processing algorithms. To predict the speech recognition threshold (SRT) of normal-hearing listeners, an automatic speech recognition (ASR) system is employed that uses a deep neural network (DNN) to convert the acoustic input into phoneme predictions, which are subsequently decoded into word transcripts. ASR results are obtained with and compared to data presented in Schubotz et al. (2016), which comprises eight different additive maskers that range from speech-shaped stationary noise to a single-talker interferer and responses from eight normal-hearing subjects. The task for listeners and ASR is to identify noisy words from a German matrix sentence test in monaural conditions. Two ASR training schemes typically used in applications are considered: (A) matched training, which uses the same noise type for training and testing and (B) multi-condition training, which covers all eight maskers. For both training schemes, ASR-based predictions outperform established measures such as the extended speech intelligibility index (ESII), the multi-resolution speech envelope power spectrum model (mr-sEPSM) and others. This result is obtained with a speaker-independent model that compares the word labels of the utterance with the ASR transcript, which does not require separate noise and speech signals. The best predictions are obtained for multi-condition training with amplitude modulation features, which implies that the noise type has been seen during training. Predictions and measurements are analyzed by comparing speech recognition thresholds and individual psychometric functions to the DNN-based results.
Read full abstract