Abstract
While hidden Markov models (HMMs) serve as the basic acoustic modeling framework for many automatic speech recognition systems, they are known to model the duration of sound units poorly. Phone duration normalization can be accomplished by adding and reconstructing missing frames when a phone is shorter than the desired duration, and by deleting frames when a phone is longer than the desired duration. If phone segmentations are known a priori, this technique achieves relative reductions in word error rate (WER) of up to 35%, confirming the conjecture that speech with normalized phone durations may be modeled better and discriminated more accurately using standard HMM acoustic models. Unfortunately, duration normalization using imperfect automatically generated phone segmentations has not yielded significant recognition improvements. A modification of the duration normalization approach has been developed. Three different feature streams are generated for each utterance using various combinations of expansion and contraction of hypothesized phone segments. Each stream is recognized using an acoustic model trained for that stream. While the resulting recognition hypotheses themselves are not significantly better than baseline, these hypotheses can be automatically combined to produce relative improvements in WER of up to 7.7% over several speech databases. [Work supported by DARPA and Telefónica.]
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.