Abstract
End-to-end (E2E) models, including the attention-based encoder-decoder (AED) models, have achieved promising performance on the automatic speech recognition (ASR) task. However, the supervised training process of the E2E model needs a large amount of speech-text paired data. In contrast, self-supervised pre-training can pre-train the model on the unlabeled data and then fine-tune it on the limited labeled data to realize better performance. Most of the previous self-supervised pre-training methods focus on learning hidden representations from speech but ignore how to utilize the unpaired text. As a result, previous works often pre-train an acoustic encoder and then fine-tune it as a classification based ASR model, such as Connectionist Temporal Classification (CTC) based model, rather than an AED model. In this paper, we propose a self-supervised pre-training method for the AED model (SP-AED). The SP-AED method contains acoustic pre-training for the encoder, linguistic pre-training for the decoder, and an adaptive combination fine-tuning for the whole system. We first design a linguistic pre-training method for decoder by utilizing the text-only data. The decoder will be pre-trained as a noise-condition language model to learn the prior distribution of the text. Then, we pre-train the AED encoder with the wav2vec2.0 method with some modifications. Finally, we combine the pre-trained encoder and decoder and fine-tune them on the limited labeled data. We design an adaptive combination method during fine-tuning by modifying the decoder's input and output to prevent catastrophic forgetting. Experiments prove that compared with the random initialized models, the SP-AED pre-trained models can realize up to 17% relative improvement. And with similar model size or computational cost, we can get comparable results to other classification-based models on both English and Chinese corpus.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
More From: IEEE/ACM Transactions on Audio, Speech, and Language Processing
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.