Use of explicit duration models in speech recognition

Cynthia Fong,Mari Ostendorf

doi:10.1121/1.407913

Abstract

Although researchers have observed that automatic speech recognition errors are often associated with improbable duration patterns, current recognition systems take little advantage of known durational cues. In addition to using simplistic models of duration, recognizers put very little weight on durational probabilities (because of the high dimensionality of the feature vectors typically used). This work explores the benefits that can be gained in speaker‐independent continuous speech recognition from more accurate duration modeling and automatically estimating a duration score weight. In the context of the stochastic segment model, segment durations are modeled using gamma distributions conditioned on phonetic, lexical, and phrasal context, as well as speaking rate. Robust parameter estimation, a potential problem with the large number of different conditioning factors, is achieved using automatic clustering techniques with a maximum likelihood criterion. A by‐product of the clustering procedure is a better understanding of the relative importance of the conditioning factors. Speech recognition experiments on the resource management task show that simple context‐independent gamma distributions give similar results to nonparametric relative frequency models, and that error can be reduced by more than 10% by conditioning models on such factors as prepausal position, speaking rate, and lexical stress.

Full Text