Audio-Driven Deformation Flow for Effective Lip Reading

Dalu Feng,Xilin Chen,Shiguang Shan,Shuang Yang

doi:10.1109/icpr56361.2022.9956316

Abstract

Lip reading, also known as visual speech recognition (VSR), is the task to recognize the speech content using only the visual modality. Inspired by the natural synchronization between acoustic speech signal and the speaker’s facial movements in the speaking process, some methods have begun to introduce the auditory modality to help the learning process of the lip reading models, especially by distilling knowledge from the audio speech recognition models to the lip reading models. However, existing works usually overlook the domain gap between the audio and visual modalities, which greatly limits the ability of the lip reading models to learn speech-related information from the audio modality and so further hinder the improvement of the lip reading models for the VSR task. In this paper, we aim to establish a bridge between the audio modality and the visual modality for the lip reading model to learn more effectively from the audio modality. Specifically, we introduce the audio-driven deformation flow to reflect the potential visual dynamics corresponding with the acoustic speech signal. The generated deformation flow is directly decided by the input acoustic speech signals and so focuses more on the facial dynamics corresponding with the speech signal, rather than the unrelated visual conditions, like illumination, pose, skin color, and so on. This property makes the flow-based model more effective as the teacher than the usual ASR models for the lip reading task. With this basic idea, we propose an encoder-decoder architecture to generate the deformation flow and distill the speech-related knowledge from the deformation flow-based VSR model to the lip reading models, instead of from the ASR models directly. Finally, we evaluate our method on two popular large-scale lip reading datasets, LRW and LRS2-BBC, respectively. The results show that our method can not only improve the lip reading model’s performance without extra computation cost at the test phase, but also achieve higher performance than distilling from the ASR model directly which shows the advantages of the proposed deformation flow based method.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Audio-Driven Deformation Flow for Effective Lip Reading

Abstract

Talk to us

Similar Papers

Lead the way for us

Similar Papers

Visual Speech Processing and Recognition
U B Mahadevaswamy ... V Sangameshwar
-
U B Mahadevaswamy, et. al.U B Mahadevaswamy ... V Sangameshwar
26 May 2020
26 May 2020

SYNFACE - a talking face telephone
Inger Karlsson ... Andrew Faulkner
-
Inger Karlsson, et. al.Inger Karlsson ... Andrew Faulkner
01 Sep 2003
01 Sep 2003

Improved lip contour extraction for visual speech recognition
Srinivasa Rao Chalamala ... B Yegnanarayana
-
Srinivasa Rao Chalamala, et. al.Srinivasa Rao Chalamala ... B Yegnanarayana
01 Jan 2015
01 Jan 2015

A real-time lip localization and tacking for lip reading
Yao Wenjuan ... Du Minghui
-
Yao Wenjuan, et. al. Yao Wenjuan ... Du Minghui
01 Aug 2010
01 Aug 2010

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Audio-Driven Deformation Flow for Effective Lip Reading

Abstract

Talk to us

Similar Papers