Abstract
Lip reading, also known as visual speech recognition (VSR), is the task to recognize the speech content using only the visual modality. Inspired by the natural synchronization between acoustic speech signal and the speaker’s facial movements in the speaking process, some methods have begun to introduce the auditory modality to help the learning process of the lip reading models, especially by distilling knowledge from the audio speech recognition models to the lip reading models. However, existing works usually overlook the domain gap between the audio and visual modalities, which greatly limits the ability of the lip reading models to learn speech-related information from the audio modality and so further hinder the improvement of the lip reading models for the VSR task. In this paper, we aim to establish a bridge between the audio modality and the visual modality for the lip reading model to learn more effectively from the audio modality. Specifically, we introduce the audio-driven deformation flow to reflect the potential visual dynamics corresponding with the acoustic speech signal. The generated deformation flow is directly decided by the input acoustic speech signals and so focuses more on the facial dynamics corresponding with the speech signal, rather than the unrelated visual conditions, like illumination, pose, skin color, and so on. This property makes the flow-based model more effective as the teacher than the usual ASR models for the lip reading task. With this basic idea, we propose an encoder-decoder architecture to generate the deformation flow and distill the speech-related knowledge from the deformation flow-based VSR model to the lip reading models, instead of from the ASR models directly. Finally, we evaluate our method on two popular large-scale lip reading datasets, LRW and LRS2-BBC, respectively. The results show that our method can not only improve the lip reading model’s performance without extra computation cost at the test phase, but also achieve higher performance than distilling from the ASR model directly which shows the advantages of the proposed deformation flow based method.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.