Adapting Pre-Trained Self-Supervised Learning Model for Speech Recognition with Light-Weight Adapters

Xianghu Yue,Xinyuan Qian,Xiaoxue Gao,Haizhou Li

doi:10.3390/electronics13010190

Abstract

Self-supervised learning (SSL) is an effective way of learning rich and transferable speech representations from unlabeled data to benefit downstream tasks. However, effectively incorporating a pre-trained SSL model into an automatic speech recognition (ASR) system remains challenging. In this paper, we propose a network architecture with light-weight adapters to adapt a pre-trained SSL model for an end-to-end (E2E) ASR. An adapter is introduced in each SSL network layer and trained on the downstream ASR task, while the parameters of the pre-trained SSL network layers remain unchanged. By carrying over all pre-trained parameters, we avoid the catastrophic forgetting problem. At the same time, we allow the network to quickly adapt to ASR task with light-weight adapters. The experiments using LibriSpeech and Wall Street Journal (WSJ) datasets show that (1) the proposed adapter-based fine-tuning consistently outperforms full-fledged training in low-resource scenarios, with up to 17.5%/12.2% relative word error rate (WER) reduction on the 10 min LibriSpeech split; (2) the adapter-based adaptation also shows competitive performance in high-resource scenarios, which further validates the effectiveness of the adapters.

Full Text