Abstract
Automatic speech recognition (ASR) has reached human performance on many clean speech corpora, but it remains worse than human listeners in noisy environments. This paper investigates whether this difference in performance might be due to a difference in the time-frequency regions that each listener utilizes in making their decisions and how these “important” regions change for ASRs using different acoustic models (AMs) and language models (LMs). We define important regions as time-frequency points in a spectrogram that tend to be audible when the listener correctly recognizes that utterance in noise. The evidence from this study indicates that a neural network AM attends to regions that are more similar to those of humans (capturing certain high-energy regions) than those of a traditional Gaussian mixture model (GMM) AM. Our analysis also shows that the neural network AM has not yet captured all the cues that human listeners utilize, such as certain transitions between silence and high speech energy. We also find that differences in important time-frequency regions tend to track differences in accuracy on specific words in a test sentence, suggesting a connection. Because of this connection, adapting an ASR to attend to the same regions humans use might improve its generalization in noise.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
More From: IEEE/ACM Transactions on Audio, Speech, and Language Processing
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.