Abstract
An accurate objective prediction of human speech intelligibility is of interest for many applications such as the evaluation of signal processing algorithms. To predict the speech recognition threshold (SRT) of normal-hearing listeners, an automatic speech recognition (ASR) system is employed that uses a deep neural network (DNN) to convert the acoustic input into phoneme predictions, which are subsequently decoded into word transcripts. ASR results are obtained with and compared to data presented in Schubotz et al. (2016), which comprises eight different additive maskers that range from speech-shaped stationary noise to a single-talker interferer and responses from eight normal-hearing subjects. The task for listeners and ASR is to identify noisy words from a German matrix sentence test in monaural conditions. Two ASR training schemes typically used in applications are considered: (A) matched training, which uses the same noise type for training and testing and (B) multi-condition training, which covers all eight maskers. For both training schemes, ASR-based predictions outperform established measures such as the extended speech intelligibility index (ESII), the multi-resolution speech envelope power spectrum model (mr-sEPSM) and others. This result is obtained with a speaker-independent model that compares the word labels of the utterance with the ASR transcript, which does not require separate noise and speech signals. The best predictions are obtained for multi-condition training with amplitude modulation features, which implies that the noise type has been seen during training. Predictions and measurements are analyzed by comparing speech recognition thresholds and individual psychometric functions to the DNN-based results.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.