Abstract

Although the attention-based speech recognition has achieved promising performances, the specific explanation of the intermediate representations remains a black box theory. In this paper, we use the method to visually show and explain continuous encoder outputs. We propose a human-intervened force alignment method to obtain labels for t-distributed stochastic neighbor embedding (t-SNE), and use them to better understand the attention mechanism and the recurrent representations. In addition, we combine t-SNE and canonical correlation analysis (CCA) to analyze the training dynamics of phones in the attention-based model. Experiments are carried on TIMIT and WSJ respectively. The aligned embeddings of the encoder outputs could form sequence manifolds of the ground truth labels. Figures of t-SNE embeddings visually show what representations the encoder shaped into and how the attention mechanism works for the speech recognition. The comparisons between different models, different layers, and different lengths of the utterance show that manifolds are clearer in the shape when outputs are from the deeper layer of the encoder, the shorter utterance, and models with better performances. We also observe that the same symbols from different utterances tend to gather at similar positions, which proves the consistency of our method. Further comparisons are taken between different epochs of the model using t-SNE and CCA. The results show that both the plosive and the nasal/flap phones converge quickly, while the long vowel phone converge slowly.

Highlights

  • The traditional techniques separate a speech recognition system into a variety of modules

  • As for the attention-based model, we would like to explore the outputs from intermediate layers of recurrent neural network (RNN) and why attention mechanism works for speech recognition

  • The attention-based model is difficult at decoding long sentences [38]–[40], we further demonstrate that the attention-based model is difficult at modeling phones with long tones in the speech recognition

Read more

Summary

INTRODUCTION

The traditional techniques separate a speech recognition system into a variety of modules. The CTC model inserts extra blank symbols to make the length of the outputs consistent with the input sequences It is optimized at a sequence-level, it is still a variant of the Markov model which is based on an independent assumption [8]. Qu: Towards Understanding Attention-Based Speech Recognition Models transducer [13], [14], the attention layer is a basic structure for many end-to-end research It is more interpretable than other methods, and it achieves rather promising results [5], [15], [16]. As for the attention-based model, we would like to explore the outputs from intermediate layers of recurrent neural network (RNN) and why attention mechanism works for speech recognition

RELATED WORKS
ANALYZING PHONE DYNAMICS USING CCA
DESCRIPTIONS OF SPEECH RECOGNITION MODELS
Findings
CONCLUSION
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.