Abstract

Between‐word coarticulation is one of the major problems in continuous speech recognition since it modifies the acoustic characteristics at the boundaries of words. With large vocabulary speech recognition the problem can be solved by introducing the concept of interword units. If the vocabulary is small enough (e.g., digits) all possible coarticulations between all words of the vocabulary can be modeled. In this study every digit is represented by three segments, namely, a core segment that can be assumed reasonably insensitive to any coarticulation effect and head and tail segments that represent, respectively, the initial and the final part of every work spoken in isolation. In addition, a set of juncture segments is defined that represent the junction between every possible pair of words. The recognition process is driven by a regular grammar that represents all the allowed segments sequences, namely, digits spoken in isolation as well as sequences of digits spoken continuously or with pauses between words. Every segment is represented by a mixture density HMM. Since the vocabulary is composed of 11 words (digits 0 to 9 and oh) the overall number of models is 154 (11 cores, 11 heads, 11 tails, and 121 junctures). A number of experiments were carried out using the TI connected digits database, which consists of a set of digit strings. Recognition was performed with different experimental conditions, i.e., varying the number of mixture components, the number of models per segment, and the structure of the connection between words. String error rates of 3.27% and 1.97% were obtained for the best and the two best hypothesized strings. These results compare favorably with those obtained with whole word modeling.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call