Abstract
Lipreading refers to understanding and further translating the speech of a video speaker into textual outputs. State-of-the-art lipreading methods excel in interpreting overlap speakers, i.e., speakers appear in both training and inference. However, generalizing those methods to <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">unseen</i> speakers incurs catastrophic performance degradation due to the limited number of speakers in training bank as well as the dominant visual variations caused by the shape/color of lips presented by different speakers. Therefore, merely depending on the visible changes of lips tends to overfit the model. To improve to generalise, in this paper we propose to use multi-modal features, i.e., visual and landmark, to describe the lip motion while being irrespective to speaker characteristics. The proposed sentence-level framework, dubbed <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">LipFormer</i> , is based on visual-landmark transformer architecture wherein a lip motion stream, a facial landmark stream, and a cross-modal fusion are interconnected. More specifically, the two-stream embeddings produced by self-attention are prompted into a cross-attention module to achieve the alignment across visual and landmark variations. The resulting fused features are decoded into linguistic texts by a cascaded sequence-to-sequence translation. Extensive experiments demonstrate that our method can generalise well to unseen speakers in multiple datasets.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
More From: IEEE Transactions on Circuits and Systems for Video Technology
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.