Self-Supervised Pre-Training for Attention-Based Encoder-Decoder ASR Model

Changfeng Gao,Pengyuan Zhang,Yonghong Yan,Ta Li,Gaofeng Cheng

doi:10.1109/taslp.2022.3171967

Abstract

End-to-end (E2E) models, including the attention-based encoder-decoder (AED) models, have achieved promising performance on the automatic speech recognition (ASR) task. However, the supervised training process of the E2E model needs a large amount of speech-text paired data. In contrast, self-supervised pre-training can pre-train the model on the unlabeled data and then fine-tune it on the limited labeled data to realize better performance. Most of the previous self-supervised pre-training methods focus on learning hidden representations from speech but ignore how to utilize the unpaired text. As a result, previous works often pre-train an acoustic encoder and then fine-tune it as a classification based ASR model, such as Connectionist Temporal Classification (CTC) based model, rather than an AED model. In this paper, we propose a self-supervised pre-training method for the AED model (SP-AED). The SP-AED method contains acoustic pre-training for the encoder, linguistic pre-training for the decoder, and an adaptive combination fine-tuning for the whole system. We first design a linguistic pre-training method for decoder by utilizing the text-only data. The decoder will be pre-trained as a noise-condition language model to learn the prior distribution of the text. Then, we pre-train the AED encoder with the wav2vec2.0 method with some modifications. Finally, we combine the pre-trained encoder and decoder and fine-tune them on the limited labeled data. We design an adaptive combination method during fine-tuning by modifying the decoder's input and output to prevent catastrophic forgetting. Experiments prove that compared with the random initialized models, the SP-AED pre-trained models can realize up to 17% relative improvement. And with similar model size or computational cost, we can get comparable results to other classification-based models on both English and Chinese corpus.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Self-Supervised Pre-Training for Attention-Based Encoder-Decoder ASR Model

Abstract

Talk to us

Similar Papers

More From: IEEE/ACM Transactions on Audio, Speech, and Language Processing

Lead the way for us

Journal: IEEE/ACM Transactions on Audio, Speech, and Language Processing	Publication Date: Jan 1, 2022
Citations: 5

Similar Papers

Integrating Knowledge Into End-to-End Speech Recognition From External Text-Only Data
Ye Bai ... Jiangyan Yi
IEEE/ACM Transactions on Audio, Speech, and Language Processing | VOL. 29
Ye Bai, et. al.Ye Bai ... Jiangyan Yi
01 Jan 2020
IEEE/ACM Transactions on Audio, Speech, and Language Processing | VOL. 29

Character-Aware Attention-Based End-to-End Speech Recognition
Zhong Meng ... Yashesh Gaur
-
Zhong Meng, et. al.Zhong Meng ... Yashesh Gaur
01 Dec 2019
01 Dec 2019

Speaker Adaptation for Attention-Based End-to-End Speech Recognition
Zhong Meng ... Yashesh Gaur
-
Zhong Meng, et. al.Zhong Meng ... Yashesh Gaur
15 Sep 2019
15 Sep 2019

Do End-to-End Speech Recognition Models Care About Context?
Lasse Borgholt ... Jakob D Havtorn
-
Lasse Borgholt, et. al.Lasse Borgholt ... Jakob D Havtorn
25 Oct 2020
25 Oct 2020

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Self-Supervised Pre-Training for Attention-Based Encoder-Decoder ASR Model

Abstract

Talk to us

Similar Papers

More From: IEEE/ACM Transactions on Audio, Speech, and Language Processing