Efficiently Fusing Pretrained Acoustic and Linguistic Encoders for Low-Resource Speech Recognition

Cheng Yi,Bo Xu,Shiyu Zhou

doi:10.1109/lsp.2021.3071668

Abstract

End-to-end models have achieved impressive results on the task of automatic speech recognition (ASR). For low-resource ASR tasks, however, labeled data can hardly satisfy the demand of end-to-end models. Self-supervised acoustic pre-training has already shown its amazing ASR performance, while the transcription is still inadequate for language modeling in end-to-end models. In this work, we fuse a pre-trained acoustic encoder (wav2vec2.0) and a pre-trained linguistic encoder (BERT) into an end-to-end ASR model. The fused model only needs to learn the transfer from speech to language during fine-tuning on limited labeled data. The length of the two modalities is matched by a monotonic attention mechanism without additional parameters. Besides, a fully connected layer is introduced for the hidden mapping between modalities. We further propose a scheduled fine-tuning strategy to preserve and utilize the text context modeling ability of the pre-trained linguistic encoder. Experiments show our effective utilizing of pre-trained modules. Our model achieves better recognition performance on CALLHOME corpus (15 hours) than other end-to-end models.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Efficiently Fusing Pretrained Acoustic and Linguistic Encoders for Low-Resource Speech Recognition

Abstract

Talk to us

Similar Papers

More From: IEEE Signal Processing Letters

Lead the way for us

Journal: IEEE Signal Processing Letters	Publication Date: Jan 1, 2021
Citations: 52

Similar Papers

Recognition of target domain Japanese speech using language model replacement
Daiki Mori ... Norihide Kitaoka
EURASIP Journal on Audio, Speech, and Music Processing | VOL. 2024
Daiki Mori, et. al.Daiki Mori ... Norihide Kitaoka
20 Jul 2024
EURASIP Journal on Audio, Speech, and Music Processing | VOL. 2024

A closer look at reinforcement learning-based automatic speech recognition
Fan Yang ... Rita Singh
Computer Speech & Language | VOL. 87
Fan Yang, et. al.Fan Yang ... Rita Singh
16 Mar 2024
Computer Speech & Language | VOL. 87

The use of discrete distributions with a very large codebook for automatic speech recognition and speaker verification
Guoli Ye
-
Guoli YeGuoli Ye
23 Dec 2014
23 Dec 2014

Investigation of a Single-Channel Frequency-Domain Speech Enhancement Network to Improve End-to-End Bengali Automatic Speech Recognition Under Unseen Noisy Conditions
Md Mahbub E Noor ... Hsin-Min Wang
-
Md Mahbub E Noor, et. al.Md Mahbub E Noor ... Hsin-Min Wang
18 Nov 2021
18 Nov 2021

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Efficiently Fusing Pretrained Acoustic and Linguistic Encoders for Low-Resource Speech Recognition

Abstract

Talk to us

Similar Papers

More From: IEEE Signal Processing Letters