Abstract

Lipreading is a type of human–computer interaction based on visual information. From the perspective of pronunciation principle, lip is only one of the vocal tract organs, and it is difficult to express the complete pronunciation process. It is very challenging to recognize speech contents solely through lip movements. As the shape of the vocal tract determines the final sound during the pronunciation process, we propose to improve the accuracy of lipreading by concurrently exploiting the lip features and reconstructed vocal tract features, and we call our method Modal Amplification Lipreading (MALip). We extend the U-Net model to learn the vocal tract features from Mel-Spectrogram features extracted from audio. Our model aims to reduce the computation complexity while ensuring the extended vocal tract features to have a good quality. We also introduce techniques to ensure the vocal tract features to be effective without being compromised by interference from noise or invalid audio. In addition, to facilitate the study with the incorporation of vocal tract features, we recorded a maximum sentence-level Chinese dataset ICSLR in an experimental environment, and verify the effectiveness of reconstructed audio features in improving the lip recognition accuracy for the first time. Through extensive experiments on ICSLR and publicly natural sentence dataset CMLR, we demonstrate the effectiveness of our MALip method compared with state-of-the-art counterparts.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call