Abstract
Lipreading is a type of human–computer interaction based on visual information. From the perspective of pronunciation principle, lip is only one of the vocal tract organs, and it is difficult to express the complete pronunciation process. It is very challenging to recognize speech contents solely through lip movements. As the shape of the vocal tract determines the final sound during the pronunciation process, we propose to improve the accuracy of lipreading by concurrently exploiting the lip features and reconstructed vocal tract features, and we call our method Modal Amplification Lipreading (MALip). We extend the U-Net model to learn the vocal tract features from Mel-Spectrogram features extracted from audio. Our model aims to reduce the computation complexity while ensuring the extended vocal tract features to have a good quality. We also introduce techniques to ensure the vocal tract features to be effective without being compromised by interference from noise or invalid audio. In addition, to facilitate the study with the incorporation of vocal tract features, we recorded a maximum sentence-level Chinese dataset ICSLR in an experimental environment, and verify the effectiveness of reconstructed audio features in improving the lip recognition accuracy for the first time. Through extensive experiments on ICSLR and publicly natural sentence dataset CMLR, we demonstrate the effectiveness of our MALip method compared with state-of-the-art counterparts.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.