LipFormer: Learning to Lipread Unseen Speakers Based on Visual-Landmark Transformers

Feng Xue,Richang Hong,Yu Li,Lin Wu,Yincen Xie,Deyin Liu

doi:10.1109/tcsvt.2023.3282224

Abstract

Lipreading refers to understanding and further translating the speech of a video speaker into textual outputs. State-of-the-art lipreading methods excel in interpreting overlap speakers, i.e., speakers appear in both training and inference. However, generalizing those methods to <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">unseen</i> speakers incurs catastrophic performance degradation due to the limited number of speakers in training bank as well as the dominant visual variations caused by the shape/color of lips presented by different speakers. Therefore, merely depending on the visible changes of lips tends to overfit the model. To improve to generalise, in this paper we propose to use multi-modal features, i.e., visual and landmark, to describe the lip motion while being irrespective to speaker characteristics. The proposed sentence-level framework, dubbed <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">LipFormer</i> , is based on visual-landmark transformer architecture wherein a lip motion stream, a facial landmark stream, and a cross-modal fusion are interconnected. More specifically, the two-stream embeddings produced by self-attention are prompted into a cross-attention module to achieve the alignment across visual and landmark variations. The resulting fused features are decoded into linguistic texts by a cascaded sequence-to-sequence translation. Extensive experiments demonstrate that our method can generalise well to unseen speakers in multiple datasets.

Full Text

Published version (

Free)

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

LipFormer: Learning to Lipread Unseen Speakers Based on Visual-Landmark Transformers

Abstract

Talk to us

Similar Papers

More From: IEEE transactions on circuits and systems for video technology : a publication of the Circuits and Systems Society

Lead the way for us

Journal: IEEE transactions on circuits and systems for video technology : a publication of the Circuits and Systems Society	Publication Date: Sep 1, 2023
Citations: 4

Similar Papers

Speaker-Adaptive Lip Reading with User-Dependent Padding
Minsu Kim ... Hyunjun Kim
-
Minsu Kim, et. al.Minsu Kim ... Hyunjun Kim
01 Jan 2021
01 Jan 2021

Class II subdivision malocclusion
Suruchi Jain ... Ashima Valiathan
American Journal of Orthodontics | VOL. 133
Suruchi Jain, et. al.Suruchi Jain ... Ashima Valiathan
31 Dec 2008
American Journal of Orthodontics | VOL. 133

Lip Movements Generation at a Glance
Lele Chen ... Chenliang Xu
-
Lele Chen, et. al.Lele Chen ... Chenliang Xu
01 Jan 2018
01 Jan 2018

Non-invasive quick diagnosis of cardiovascular problems from visible and invisible abnormal changes with increased cardiac troponin I appearing on cardiovascular representation areas of the eyebrows, left upper lip, etc. of the face & hands: beneficial manual stimulation of hands for acute anginal chest pain, and important factors in safe, effective treatment.
Yoshiaki Omura ... Marilyn K Jones
Acupuncture & Electro-Therapeutics Research | VOL. 39
Yoshiaki Omura, et. al.Yoshiaki Omura ... Marilyn K Jones
01 Aug 2014
Acupuncture & Electro-Therapeutics Research | VOL. 39

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

LipFormer: Learning to Lipread Unseen Speakers Based on Visual-Landmark Transformers

Abstract

Talk to us

Similar Papers

More From: IEEE transactions on circuits and systems for video technology : a publication of the Circuits and Systems Society