Abstract
Automatic lip reading has advanced significantly in recent years. However, these methods need large-scale datasets that are scarce for many low-resource languages. In this paper, we introduce a new multipurpose audio-visual dataset for Persian. The dataset contains approximately 220 h of videos from 1760 speakers. The dataset can be used for multiple tasks, such as lip reading, automatic speech recognition, audio-visual speech recognition, and speaker recognition. It is also the first large-scale lip reading dataset in this language. We provide a baseline method for each task and propose a technique to identify visemes (visual units of speech) in Persian. The visemes obtained by this technique improve the accuracy of the lip reading task by 7% relatively compared to the previously proposed visemes, which can be generalized to other languages as well.
Submitted Version (Free)
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.