Abstract

For a large-vocabulary speech-recognition system, such as Dragon Systems' 30,000 word DragonDictate recognizer, an efficient approach to training is to use “phonemes-in-context” (PICs) which are triphones supplemented by a code to describe prepausal lengthening. Each PIC is in turn represented by a sequence of one to six “phonetic elements” (PELs). For each phoneme, there may be thousands of different PICs, but there are no more than 63 PELs. Initially all PICs and PELs are trained from a database of about 16,000 tokens recorded by a reference speaker. When the recognizer is used by a new speaker, each word that is recognized is immediately used to adapt the PELs in its Markov models. After about a thousand words have been recognized, most PELs have been adapted to the new speaker, so that even models for words that have not yet been spoken are appropriate for the new speaker. The recognizer was tested with two texts that differed greatly in vocabulary and style. Three speakers dictated each text: the reference speaker, a new male speaker and a new female speaker. After adaptation on 1,500 words, performance for all three speakers was better than the performance for the reference speaker on unadapted models. With an active vocabulary of 25,000 words, the fraction of words recognized correctly was 86%, with an additional 8% on a “choice list” of eight words.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.