Cross-Modal Language Modeling in Multi-Motion-Informed Context for Lip Reading

Xi Ai,Bin Fang

doi:10.1109/taslp.2023.3282109

Abstract

We observe that for lip reading, the language is locally transformed, instead of globally transformed, i.e., speaking and writing follow the same basic grammar rules. In this work, we present a cross-modal language model to tackle the lip-reading challenge on silent videos. Compared to previous works, we consider multi-motion-informed contexts composed of multiple lip-motion representations from different subspaces to guide decoding via the source-target attention mechanism. We present a piece-wise pre-training strategy inspired by multi-task learning to pre-train a visual module to generate multi-motioninformed contexts for cross-modality and pre-train a decoder to generate texts for language modeling. Our final large-scale model outperforms baseline models on four datasets: LRS2, LRS3, LRW, and GRID. We will open our source code on GitHub.

Full Text