Abstract

Lipreading refers to understanding and further translating the speech of a video speaker into textual outputs. State-of-the-art lipreading methods excel in interpreting overlap speakers, i.e., speakers appear in both training and inference. However, generalizing those methods to <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">unseen</i> speakers incurs catastrophic performance degradation due to the limited number of speakers in training bank as well as the dominant visual variations caused by the shape/color of lips presented by different speakers. Therefore, merely depending on the visible changes of lips tends to overfit the model. To improve to generalise, in this paper we propose to use multi-modal features, i.e., visual and landmark, to describe the lip motion while being irrespective to speaker characteristics. The proposed sentence-level framework, dubbed <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">LipFormer</i> , is based on visual-landmark transformer architecture wherein a lip motion stream, a facial landmark stream, and a cross-modal fusion are interconnected. More specifically, the two-stream embeddings produced by self-attention are prompted into a cross-attention module to achieve the alignment across visual and landmark variations. The resulting fused features are decoded into linguistic texts by a cascaded sequence-to-sequence translation. Extensive experiments demonstrate that our method can generalise well to unseen speakers in multiple datasets.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call