Abstract
This chapter highlights the main techniques that are available in today's automatic speech recognition (ASR) and text-to-speech (TTS) systems, with special emphasis on the concepts and on the requirements imposed by their implementation, as well as on the resulting limitations. ASR is a major component in many spoken language systems. It enables the development of useful concepts for human–machine interfaces but also for computer-mediated human-to-human communication. Statistical modeling paradigms and their extensions are key approaches to ASR. Using proper assumptions, these technologies provide a mean to factorize the different layers of the spoken language structure. Several major components hence appear. First, the speech signal is analyzed using feature extraction algorithms. The acoustic model is then used to represent the knowledge necessary to recognize individual sounds involved in speech. Words can hence be built as sequences of those individual sounds. This is represented in the pronunciation model. Finally, the language model is used to represent the knowledge regarding the grouping of words to build sentences. ASR technology has been drawing from a range of disciplines, including digital signal processing, probability, estimation and information theories, and also, naturally, from studies about the production and perception of speech, and the structure of spoken language.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.