Abstract

This paper presents our work in building an Indonesian speech recognizer to handle both spontaneous and dictated speech. The recognizer is based on the Gaussian Mixture and Hidden Markov Models (GMM-HMM). The model is first trained on 73hours of dictated speech and 43.5 minutes of spontaneous speech. The dictated speech is read from prepared transcripts by a diverse group of 244 Indonesian speakers. The spontaneous speech is manually labelled from recordings of an Indonesian parliamentary meeting, and is interspersed with noises and fillers. The resulting triphone model is then adapted only to the spontaneous speech using the Maximum A-posteriori Probability (MAP) method. We evaluate the adapted model using separate dictated and spontaneous evaluation sets. The dictated set consists of speech from 20 speakers totaling 14.5 hours. The spontaneous set is derived from the recording of a regional government meeting, consisting of 1085 utterances totaling 48.5minutes. Evaluation of a MAP-adapted spontaneous set yields a 2.60% absolute increase in Word Accuracy Rate (WAR) over the un-adapted model, outperforming MMI adaptation. Conversely, MMI adaption of the dictated set outperforms the MAP adaptation by achieving an absolute increase of 1.48% in WAR over the un-adapted model. We also demonstrate that fMLLR speaker adaptation is unsuitable for our task due to limited adaptation data.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.