In this paper, a novel cross-lingual adaptation framework called CAM is presented for low-resource language speech recognition (LLSR). It is based on the recent popular adapter method. CAM is achieved by adapting self-supervised speech models (SSMs) from source languages to target low-resource languages in a two-stage process. CAM fills two research gaps existing in current methods: (i) language similarity is not effectively considered; (ii) the performance-efficiency trade-off is not well balanced. Specifically, two key components, a similarity-aware fusion module (SAFM) and an adapter weight-sharing strategy (AWSS), are designed. A well-trained adapter is introduced by SAFM to compute precise language similarities via dot product. AWSS takes both performance and efficiency into consideration by sharing adapter weights. Experimental results on two corpora, FLEURS and Common Voice, demonstrate that CAM equipped with these two designs, denoted as performance-oriented CAM (P-CAM), obtains state-of-the-art (SOTA) performance with satisfactory efficiency compared to current leading methods. Besides, an efficiency-oriented CAM (E-CAM) by introducing a weight-space fusion module (WSFM) is presented. The core of WSFM is to average the weights of multiple adapters into a new adapter. Compared to full fine-tuning, only 5.0% of trainable parameters are required for E-CAM while a 2.3% relative average word error rate (WER) reduction is obtained. CAM brings a performance- or efficiency-oriented trade-off to meet the different needs of speech recognition systems.
Read full abstract