Abstract
Many acoustic misrecognitions in our 86 000-word speaker-trained isolated-word recognizer are due to phonemic hidden Markov models (phoneme models) mapping to short segments of speech. When we force these models to map to longer segments corresponding to the observed minimum durations for the phonemes, then the likelihood of the incorrect phoneme sequences drops dramatically. This drop in the likelihood of the incorrect words results in significant reduction in the acoustic recognition 1 1 We use the term acoustic recognition error rate to mean the recognition error rate when every word in the vocabulary is considered a priori equally likely. error rate. Even in cases where acoustic recognition performance is unchanged, the likelihood of the correct word choice improves relative to the incorrect word choices, resulting in significant reduction in recognition error rate with the language model. On nine speakers, the error rate for acoustic recognition reduces from 18·6 to 17·3%, while the error rate with the language model reduces from 9·2 to 7·2%. We have also improved the phoneme models by correcting the segmentation of the phonemes in the training set. During training, the boundaries between phonemes are not marked accurately. We use energy to correct these boundaries. Application of an energy threshold improves the segment boundaries between stops and sonorants (vowels, liquids and glides), between fricatives and sonorants, between affricates and sonorants and between breath noise and sonorants. Training the phoneme models with these segmented phonemes results in models which increase recognition accuracy significantly. On two speakers, the error rate for acoustic recognition reduces from 26·5 to 23·1%, while the error rate with the language model reduces from 11·3 to 8·8%. This reduction in error rate is in addition to the error rate reductions obtained by imposing minimum duration constraints. The overall reduction in errors for these two speakers using minimum durations and energy thresholds is from 27·3 to 23·1% for acoustic recognition, and from 14·3 to 8·8% with the language model.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.