Abstract

Many acoustic misrecognitions in our 86 000-word speaker-trained isolated-word recognizer are due to phonemic hidden Markov models (phoneme models) mapping to short segments of speech. When we force these models to map to longer segments corresponding to the observed minimum durations for the phonemes, then the likelihood of the incorrect phoneme sequences drops dramatically. This drop in the likelihood of the incorrect words results in significant reduction in the acoustic recognition 1 1 We use the term acoustic recognition error rate to mean the recognition error rate when every word in the vocabulary is considered a priori equally likely. error rate. Even in cases where acoustic recognition performance is unchanged, the likelihood of the correct word choice improves relative to the incorrect word choices, resulting in significant reduction in recognition error rate with the language model. On nine speakers, the error rate for acoustic recognition reduces from 18·6 to 17·3%, while the error rate with the language model reduces from 9·2 to 7·2%. We have also improved the phoneme models by correcting the segmentation of the phonemes in the training set. During training, the boundaries between phonemes are not marked accurately. We use energy to correct these boundaries. Application of an energy threshold improves the segment boundaries between stops and sonorants (vowels, liquids and glides), between fricatives and sonorants, between affricates and sonorants and between breath noise and sonorants. Training the phoneme models with these segmented phonemes results in models which increase recognition accuracy significantly. On two speakers, the error rate for acoustic recognition reduces from 26·5 to 23·1%, while the error rate with the language model reduces from 11·3 to 8·8%. This reduction in error rate is in addition to the error rate reductions obtained by imposing minimum duration constraints. The overall reduction in errors for these two speakers using minimum durations and energy thresholds is from 27·3 to 23·1% for acoustic recognition, and from 14·3 to 8·8% with the language model.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call