Abstract

Audio-visual speech recognition refers to the automatic transcription of speech into text by exploiting information present in the video of the speaker's mouth region, in addition to the traditionally used acoustic signal. The use of visual information in automatic speech recognition is also known as automatic speechreading or lipreading, and has been motivated by the bimodality of human speech production and perception, coupled with the fact that audio-only speech recognition is not robust in noisy acoustic environments. Audio-visual speech recognition systems significantly outperform their audio-only counterparts, especially under ideal visual and noisy audio conditions. Incorporating visual information into speech recognition requires two new components: the visual front end, which detects the speaker's mouth area and extracts informative visual speech features from it, and the integration of the visual features into the speech recognition process. The most commonly adopted designs of these components are discussed here.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call