Abstract

It is challenging to obtain large amounts of native (matched) labels for speech audio in underresourced languages. This challenge is often due to a lack of literate speakers of the language, or in extreme cases, a lack of universally acknowledged orthography as well. One solution is to increase the amount of labeled data by using mismatched transcription, which employs transcribers who do not speak the underresourced language of interest called the target language (in place of native speakers), to transcribe what they hear as nonsense speech in their own annotation language ( $\ne$ target language). Previous uses of mismatched transcription converted it to a probabilistic transcription (PT), but PT is limited by the errors of nonnative perception. This paper proposes, instead, a multitask learning framework in which one deep neural network (DNN) is trained to optimize two separate tasks: acoustic modeling of a small number of matched transcription with matched target-language graphemes; and acoustic modeling of a large number of mismatched transcription with mismatched annotation-language graphemes. We find that: first, the multitask learning framework gives significant improvement over monolingual, semisupervised learning, multilingual DNN training, and transfer learning baselines; second, a Gaussian Mixture Model-Hidden-Markov Model (GMM-HMM) model adapted using PT improves alignments, thereby improving training; and third, bottleneck features trained on the mismatched transcriptions lead to even better alignments, resulting in further performance gains of the multitask DNN. Our experiments are conducted on the IARPA Georgian and Vietnamese BABEL corpora as well as on our newly collected speech corpus of Singapore Hokkien, an underresourced language with no standard written form.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call