Abstract

In vision Transformers, attention visualization methods are used to generate heatmaps highlighting the class-corresponding areas in input images, which offers explanations on how the models make predictions. However, it is not so applicable for explaining automatic speech recognition (ASR) Transformers. An ASR Transformer makes a particular prediction for every input token to form a sentence, but a vision Transformer only makes an overall classification for the input data. Therefore, traditional attention visualization methods may fail in ASR Transformers. In this work, we propose a novel attention visualization method in ASR Transformers and try to explain which frames of the audio result in the output text. Inspired by the model explainability, we also explore ways of improving the effectiveness of the ASR model. Comparing with other Transformer attention visualization methods, our method is more efficient and intuitively understandable, which unravels the attention calculation from information flow of Transformer attention modules. In addition, we demonstrate the utilization of visualization result in three ways: (1) We visualize attention with respect to connectionist temporal classification (CTC) loss to train an ASR model with adversarial attention erasing regularization, which effectively decreases the word error rate (WER) of the model and improves its generalization capability. (2) We visualize the attention on some specific words, interpreting the model by effectively demonstrating the semantic and grammar relationships between these words. (3) Similarly, we analyze how the model manage to distinguish homophones, using contrastive explanation with respect to homophones.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.