Abstract
Although researchers have observed that automatic speech recognition errors are often associated with improbable duration patterns, current recognition systems take little advantage of known durational cues. In addition to using simplistic models of duration, recognizers put very little weight on durational probabilities (because of the high dimensionality of the feature vectors typically used). This work explores the benefits that can be gained in speaker‐independent continuous speech recognition from more accurate duration modeling and automatically estimating a duration score weight. In the context of the stochastic segment model, segment durations are modeled using gamma distributions conditioned on phonetic, lexical, and phrasal context, as well as speaking rate. Robust parameter estimation, a potential problem with the large number of different conditioning factors, is achieved using automatic clustering techniques with a maximum likelihood criterion. A by‐product of the clustering procedure is a better understanding of the relative importance of the conditioning factors. Speech recognition experiments on the resource management task show that simple context‐independent gamma distributions give similar results to nonparametric relative frequency models, and that error can be reduced by more than 10% by conditioning models on such factors as prepausal position, speaking rate, and lexical stress.
Published Version (
Free)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.