Abstract

This article describes an efficient training method for online streaming attention-based encoder-decoder (AED) automatic speech recognition (ASR) systems. AED models have achieved competitive performance in offline scenarios by jointly optimizing all components. They have recently been extended to an online streaming framework via models such as monotonic chunkwise attention (MoChA). However, the elaborate attention calculation process is not robust for long-form speech utterances. Moreover, the sequence-level training objective and time-restricted streaming encoder cause a nonnegligible delay in token emission during inference. To address these problems, we propose CTC synchronous training (CTC-ST), in which CTC alignments are leveraged as a reference for token boundaries to enable a MoChA model to learn optimal monotonic input-output alignments. We formulate a purely end-to-end training objective to synchronize the boundaries of MoChA to those of CTC. The CTC model shares an encoder with the MoChA model to enhance the encoder representation. Moreover, the proposed method provides alignment information learned in the CTC branch to the attention-based decoder. Therefore, CTC-ST can be regarded as self-distillation of alignment knowledge from CTC to MoChA. Experimental evaluations on a variety of benchmark datasets show that the proposed method significantly reduces recognition errors and emission latency simultaneously, especially for long-form and noisy speech. We also compare CTC-ST with several methods that distill alignment knowledge from a hybrid ASR system and show that the CTC-ST can achieve a comparable tradeoff of accuracy and latency without relying on external alignment information. The best MoChA system shows performance comparable to that of RNN-transducer (RNN-T).

Highlights

  • O NLINE streaming automatic speech recognition (ASR) is a core technology for speech applications such as live captioning, simultaneous translation, voice search, and dialogue systems

  • We demonstrate that connectionist temporal classification (CTC) synchronous training (CTC-ST) can reduce the emission latency without external alignment information and achieve a tradeoff of the accuracy and emission latency comparable to that of alignment knowledge distillation from a hybrid system [32]

  • 1Although we focus on monotonic chunkwise attention (MoChA) in this work, the proposed method can be applied to any attention-based encoder-decoder (AED) model that calculates attention scores

Read more

Summary

Introduction

O NLINE streaming automatic speech recognition (ASR) is a core technology for speech applications such as live captioning, simultaneous translation, voice search, and dialogue systems. Representative approaches include the connectionist temporal classification (CTC) [4], recurrent neural network transducer (RNN-T) [5], recurrent neural aligner (RNA) [6], hybrid autoregressive transducer (HAT) [7], and attention-based encoder-decoder (AED) [8], [9] models. AED models are not suitable because they require the entire input in order to generate the initial token. RNN-T has been a practical choice because of its better performance than CTC with the help of token dependency modeling in the prediction network [3], [13], [14]. It is known that RNN-T consumes significant memory during training [15], [16] and requires a large search space during inference because of its frame-wise prediction, which significantly slows down the decoding speed

Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.