Abstract

A new phone recognizer has been implemented which extends the (phonotactic) decoding constraint to sequences of three phones. It is based on a structure similar to a second order ergodic hidden Markov model (HMM). This kind of a model assumes direct correspondence between the model states and phones, thus constraints on possible state sequences are equivalent to phonotactic constraints. Very high coverage by both left and right context-dependent phone models has been achieved using two methods. The first assumes that some contexts have the same or very similar effect on the phone in question. Thus they are merged into the same contextual class. The outcome is a set of 19 left context classes and 18 right context classes. The second assumes that left context mostly influences the beginning of a phone, whereas the right context influences the end of the phone. Each phone (a state in an ergodic HMM) is represented by a sequence of three probability density functions (pdfs), which is similar to a three state left-to-right HMM. We generate acoustic models such that the first pdf in the model is conditioned on the left context, the middle pdf is context independent (or it can also be context dependent), and the last pdf is conditioned on the right context. A large number of such quasi-triphonic acoustic models can be generated, thus providing a good triphone coverage for a given task, efficiently utilizing the available training data set. The current implementations of the recognizer described here have been applied to the DARPA Resource Management Task to demonstrate feasibility of performing phone (not phoneme) recognition using an untranscribed database, and the TIMIT database, for comparison to existing phone recognition systems. Since true phone sequences for the training utterances are not available for the RM database, they are estimated from text using a phone realization classification tree trained on the TIMIT database transcriptions. The estimates of the true phone sequences are used in training the models and generating reference phone sequences for scoring. The best phone recognition match between the most likely path through the classification tree and the phone recognizer output for the DARPA February 89 test set was 80·5% accurate and 84·0% correct. The best result obtained using the same recognizer structure on the TIMIT database is 69·4% accurate and 74·8% correct, which is a significant improvement over the best published result, when they are both reduced to the same phone set.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call