Adaptable phoneme-based models for large-vocabulary speech recognition

Paul G Bamberg,Mark A Mandel

doi:10.1016/0167-6393(91)90047-w

Abstract

For a large-vocabulary speech-recognition system, such as Dragon Systems' 30,000 word DragonDictate recognizer, an efficient approach to training is to use “phonemes-in-context” (PICs) which are triphones supplemented by a code to describe prepausal lengthening. Each PIC is in turn represented by a sequence of one to six “phonetic elements” (PELs). For each phoneme, there may be thousands of different PICs, but there are no more than 63 PELs. Initially all PICs and PELs are trained from a database of about 16,000 tokens recorded by a reference speaker. When the recognizer is used by a new speaker, each word that is recognized is immediately used to adapt the PELs in its Markov models. After about a thousand words have been recognized, most PELs have been adapted to the new speaker, so that even models for words that have not yet been spoken are appropriate for the new speaker. The recognizer was tested with two texts that differed greatly in vocabulary and style. Three speakers dictated each text: the reference speaker, a new male speaker and a new female speaker. After adaptation on 1,500 words, performance for all three speakers was better than the performance for the reference speaker on unadapted models. With an active vocabulary of 25,000 words, the fraction of words recognized correctly was 86%, with an additional 8% on a “choice list” of eight words.

Full Text