Abstract
Lip reading aims at recognizing texts from a talking face without audio information. Recently, some works have focused on how to effectively extract the spatial information and temporal information. We introduce an innovative two-stream network to make full use of the complementarity of global spatial information and local spatial information. The global spatial information is directly generated by the global stream. Furthermore, we design a patches selection module in the local stream to conveniently select the critical local information using attention mechanism. Then, the fusion features of the two streams and the global features are fed into the temporal module to further explore the temporal clues. To guide the selection of the local information from the fused features and to make the global stream and local stream learn from each other, we design a global information guide loss and a mutual learning loss, respectively. Finally, extensive experiments on both LRW and CAS-VSR-W1K datasets demonstrate the superiority of our two-stream work.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.