Abstract
In this paper, we propose to use acoustic feature based submodular function optimization to select a subset of untranscribed data for manual transcription, and retrain the initial acoustic model with the additional transcribed data. The acoustic features are obtained from an unsupervised Gaussian mixture model. We also integrate the acoustic features with the phonetic features, which are obtained from an initial ASR system, in the submodular function. Submodular function optimization has been theoretically shown its near-optimal guarantee. We performed the experiments on 1000 hours of Mandarin mobile phone speech, in which 300 hours of initial data was for the training of an initial acoustic model. The experimental results show that the acoustic feature based approach, which does not rely on an initial ASR system, performs as well as the phonetic feature based approach. Moreover, there is complementary effect between the acoustic feature based and the phonetic feature based data selection. The submodular function with the combined features provides a relative 4.8% character error rate (CER) reduction over the corresponding ASR system using random selection. We also include the desired feature distribution obtained from a development set in a generalized function, but the improvement is insignificant.
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have