Abstract

Multilingual models for Automatic Speech Recognition (ASR) are attractive as they have been shown to benefit from more training data, and better lend themselves to adaptation to under-resourced languages. However, initialisation from monolingual context-dependent models leads to an explosion of context-dependent states. Connectionist Temporal Classification (CTC) is a potential solution to this as it performs well with monophone labels.We investigate multilingual CTC training in the context of adaptation and regularisation techniques that have been shown to be beneficial in more conventional contexts. The multilingual model is trained to model a universal International Phonetic Alphabet (IPA)-based phone set using the CTC loss function. Learning Hidden Unit Contribution (LHUC) is investigated to perform language adaptive training. During cross-lingual adaptation, the idea of extending the multilingual output layer to new phonemes is introduced and investigated. In addition, dropout during multilingual training and cross-lingual adaptation is also studied and tested in order to mitigate the overfitting problem.Experiments show that the performance of the universal phoneme-based CTC system can be improved by applying dropout and LHUC and it is extensible to new phonemes during cross-lingual adaptation. Updating all acoustic model parameters shows consistent improvement on limited data. Applying dropout during adaptation can further improve the system and achieve competitive performance with Deep Neural Network / Hidden Markov Model (DNN/HMM) systems on limited data.

Highlights

  • Automatic speech recognition (ASR) systems have improved dramatically in recent years

  • It has been shown that recognition accuracy can reach human parity on certain tasks (Xiong et al, 2017), building Automatic Speech Recognition (ASR) systems with good performance requires a lot of training data

  • The contribution of this paper is threefold: First, we demonstrate that Learning Hidden Unit Contribution (LHUC) is an effective language adaptive training approach to improve multilingual Connectionist Temporal Classification (CTC) model

Read more

Summary

Introduction

Automatic speech recognition (ASR) systems have improved dramatically in recent years. It has been shown that recognition accuracy can reach human parity on certain tasks (Xiong et al, 2017), building ASR systems with good performance requires a lot of training data. There is increased interest in rapidly developing high performance ASR systems with limited data. A common solution is to explore universal phonetic structures among different languages by sharing the hidden layers in deep neural networks (DNNs). The target of the multilingual DNN can be either the universal International Phonetic Alphabet (IPA) based multilingual senones (e.g., Dupont et al, 2005; Lin et al, 2009; Vu et al, 2014) or a layer consisting of separate activations for each language (e.g., Scanzio et al, 2008; Huang et al, 2013; Ghoshal et al, 2013; Heigold et al, 2013).

Objectives
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call