Abstract

We present our study on semi-supervised Gaussian mixture model (GMM) hidden Markov model (HMM) and deep neural network (DNN) HMM acoustic model training. We analyze the impact of transcription quality and data sampling approach on the performance of the resulting model, and propose a multisystem combination and confidence re-calibration approach to improve the transcription inference and data selection. Compared to using a single system recognition result and confidence score, our proposed approach reduces the phone error rate of the inferred transcription by 23.8% relatively when top 60% of data are selected. Experiments were conducted on the mobile short message dictation (SMD) task. For the GMM-HMM model, we achieved 7.2% relative word error rate reduction (WERR) against a well-trained narrow-band fMPE+bMMI system by adding 2100 hours of untranscribed data, and 28.2% relative WERR over a wide-band MLE model trained from transcribed out-of-domain voice search data after adding 10K hours of untranscribed SMD data. For the CD-DNN-HMM model, 11.7% and 15.0% relative WERRs are achieved after adding 1K hours of untranscribed data using random and importance sampling, respectively. We also found using large amount of untranscribed data for pretraining does not help. Index Terms: semi-supervised acoustic model training, system combination, confidence re-calibration, importance sampling

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.