Abstract

Automatic speech recognition (ASR) has reached human performance on many clean speech corpora, but it remains worse than human listeners in noisy environments. This paper investigates whether this difference in performance might be due to a difference in the time-frequency regions that each listener utilizes in making their decisions and how these “important” regions change for ASRs using different acoustic models (AMs) and language models (LMs). We define important regions as time-frequency points in a spectrogram that tend to be audible when the listener correctly recognizes that utterance in noise. The evidence from this study indicates that a neural network AM attends to regions that are more similar to those of humans (capturing certain high-energy regions) than those of a traditional Gaussian mixture model (GMM) AM. Our analysis also shows that the neural network AM has not yet captured all the cues that human listeners utilize, such as certain transitions between silence and high speech energy. We also find that differences in important time-frequency regions tend to track differences in accuracy on specific words in a test sentence, suggesting a connection. Because of this connection, adapting an ASR to attend to the same regions humans use might improve its generalization in noise.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call