Investigation into Phone-Based Subword Units for Multilingual End-to-End Speech Recognition

Saierdaer Yusuyin,Hao Huang,Cong Liu,Junhua Liu

doi:10.1109/icassp49357.2023.10096812

Abstract

Multilingual automatic speech recognition (ASR) models with phones as modeling units have have improved greatly in low-resource and similar-language scenarios, which benefits from shared representation across languages. Meanwhile, subwords have demonstrated their effectiveness for monolingual end-to-end recognition systems. In this paper, we investigate the use of phone-based sub-words, specifically Byte Pair Encoding (BPE), as modeling units for multilingual end-to-end speech recognition. To explore the possibilities of phone-based BPE (PBPE) for multilingual ASR, we first use three types of multilingual BPE training methods for similar low-resource languages in Central Asia. Then, by adding three high-resource European languages to the experiments, we analyze language sharing degree in similar and low-resource scenarios. Finally, we propose a method to adjust the bigram statistics in the BPE algorithm and show that the PBPE representation leads to accuracy improvements in multilingual scenarios. The experiments show that PBPE outperforms phone, character and character-based BPE as output representation units. Particularly, the best PBPE model in multilingual experiments achieves a 25% relative improvement on a low-resource language compared to a character-based BPE system.

Full Text