Abstract

Lip reading aims at recognizing texts from a talking face without audio information. Due to the rapid development of deep learning techniques, researchers have made giant breakthroughs for both word-level and sentence-level English lip reading in recent years. Unlike English, it is difficult for Chinese to distinguish the lexical meanings, because Chinese is a tonal language. In addition, most of the existing Chinese lip reading datasets are designed for Mandarin, there are few for Cantonese. In this paper, we propose a word-level Cantonese lip reading dataset called CLRW which contains 800-word classes with 400,000 samples. For better practical applications, we do not limit gender, age, postures, light conditions, and speech speed to make CLRW closer to the real scene distribution. At first, we give a detailed description of the data collection process. Next, a novel two-branch network is proposed by us, named TBGL, which consists of a global branch and a local branch. The global branch models the whole lip and the local branch divides the feature into three parts to focus on subtle local lip motion. We jointly train these two branches and achieve comparable performance on LRW, CAS-VSR-W1K, and CLRW, respectively. Finally, we benchmark our dataset and perform a comprehensively analyze of the results, which demonstrate that CLRW is full of challenge, and it will bring a positive impact on further Cantonese lip reading tasks.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.