Abstract
Audio-visual speech recognition refers to the automatic transcription of speech into text by exploiting information present in the video of the speaker's mouth region, in addition to the traditionally used acoustic signal. The use of visual information in automatic speech recognition is also known as automatic speechreading or lipreading, and has been motivated by the bimodality of human speech production and perception, coupled with the fact that audio-only speech recognition is not robust in noisy acoustic environments. Audio-visual speech recognition systems significantly outperform their audio-only counterparts, especially under ideal visual and noisy audio conditions. Incorporating visual information into speech recognition requires two new components: the visual front end, which detects the speaker's mouth area and extracts informative visual speech features from it, and the integration of the visual features into the speech recognition process. The most commonly adopted designs of these components are discussed here.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.