Abstract

A hot topic in speech recognition is developing technology for the automatic transcription of telephone conversations. The recognizer must contain robust language, pronunciation, and acoustic models that embody the world and topic knowledge and the understanding of syntax and pronunciation, which the talkers have and use in decoding each other’s acoustic signals. Partly because of the talkers’ shared knowledge and the casual, unprepared nature of the speech, the signals have dysfluencies, incomplete and ungrammatical expressions, and ‘‘lazy,’’ reduced articulation of words. Conversational speech recognition error rates, measured in the NIST Hub-5 evaluations, are 45% for English and 66% to 75% for Spanish, Mandarin, and Arabic. To improve this performance, the shared knowledge must be represented in a mathematical framework, which facilitates the efficient search of the sentences of a language to decode the speech. Recent work, including workshops at Rutgers CAIP and Johns Hopkins CLSP, has included the investigation of, among other techniques, multistream processing, frequency warping, adaptation of pronunciation and acoustic models of phones, pronunciation modeling, syllable-based recognition, dysfluency and discourse-state language models, and link grammar parsing. This talk will review how knowledge is represented in the recognizer architecture, searching procedures, and the results of the various investigations.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.