Abstract

We observe that for lip reading, the language is locally transformed, instead of globally transformed, i.e., speaking and writing follow the same basic grammar rules. In this work, we present a cross-modal language model to tackle the lip-reading challenge on silent videos. Compared to previous works, we consider multi-motion-informed contexts composed of multiple lip-motion representations from different subspaces to guide decoding via the source-target attention mechanism. We present a piece-wise pre-training strategy inspired by multi-task learning to pre-train a visual module to generate multi-motioninformed contexts for cross-modality and pre-train a decoder to generate texts for language modeling. Our final large-scale model outperforms baseline models on four datasets: LRS2, LRS3, LRW, and GRID. We will open our source code on GitHub.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call