Directly Comparing the Listening Strategies of Humans and Machines

Viet Anh Trinh,Michael Mandel

doi:10.1109/taslp.2020.3040545

Abstract

Automatic speech recognition (ASR) has reached human performance on many clean speech corpora, but it remains worse than human listeners in noisy environments. This paper investigates whether this difference in performance might be due to a difference in the time-frequency regions that each listener utilizes in making their decisions and how these “important” regions change for ASRs using different acoustic models (AMs) and language models (LMs). We define important regions as time-frequency points in a spectrogram that tend to be audible when the listener correctly recognizes that utterance in noise. The evidence from this study indicates that a neural network AM attends to regions that are more similar to those of humans (capturing certain high-energy regions) than those of a traditional Gaussian mixture model (GMM) AM. Our analysis also shows that the neural network AM has not yet captured all the cues that human listeners utilize, such as certain transitions between silence and high speech energy. We also find that differences in important time-frequency regions tend to track differences in accuracy on specific words in a test sentence, suggesting a connection. Because of this connection, adapting an ASR to attend to the same regions humans use might improve its generalization in noise.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: IEEE/ACM Transactions on Audio, Speech, and Language Processing	Publication Date: Nov 25, 2020
Citations: 47	License type: publisher-specific-oa

R Discovery Prime

R Discovery Prime

Directly Comparing the Listening Strategies of Humans and Machines

Abstract

Talk to us

Similar Papers

More From: IEEE/ACM Transactions on Audio, Speech, and Language Processing

Lead the way for us

Similar Papers

Improving Russian LVCSR Using Deep Neural Networks for Acoustic and Language Modeling
Irina Kipyatkova
-
Irina KipyatkovaIrina Kipyatkova
01 Jan 2018
01 Jan 2018

Lexicon-based vs. Lexicon-free ASR for Norwegian Parliament Speech Transcription
Jan Nouza ... Jindřich Žd’Ánský
-
Jan Nouza, et. al.Jan Nouza ... Jindřich Žd’Ánský
01 Jan 2021
01 Jan 2021

Broadcast Language Identification & Subtitling System (BLISS)
Jinling Wang ... James D Connolly
-
Jinling Wang, et. al.Jinling Wang ... James D Connolly
01 Jul 2018
01 Jul 2018

Privacy Attacks for Automatic Speech Recognition Acoustic Models in A Federated Learning Framework
Natalia Tomashenko ... Marc Tommasi
-
Natalia Tomashenko, et. al.Natalia Tomashenko ... Marc Tommasi
23 May 2022
23 May 2022

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Directly Comparing the Listening Strategies of Humans and Machines

Abstract

Talk to us

Similar Papers

More From: IEEE/ACM Transactions on Audio, Speech, and Language Processing