Abstract

We like to conversate with other people using both sounds and visuals, as our perception of speech is bimodal. Essentially echoing the same speech structure, we manage to integrate the two modalities and often understand the message better than with the eyes closed. In this work we would like to learn more about the visual nature of speech, coined lip-reading, and to make use of it towards better automatic speech recognition systems. Recent developments in the Machine Learning area, together with the release of suitable audio-visual datasets aimed at large vocabulary continuous speech recognition, have led to a renewal of the lip-reading topic, and allow us to address the recurring question of how to better integrate visual and acoustic speech.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call