Automatic lip reading has advanced significantly in recent years. However, these methods need large-scale datasets that are scarce for many low-resource languages. In this paper, we introduce a new multipurpose audio-visual dataset for Persian. The dataset contains approximately 220 h of videos from 1760 speakers. The dataset can be used for multiple tasks, such as lip reading, automatic speech recognition, audio-visual speech recognition, and speaker recognition. It is also the first large-scale lip reading dataset in this language. We provide a baseline method for each task and propose a technique to identify visemes (visual units of speech) in Persian. The visemes obtained by this technique improve the accuracy of the lip reading task by 7% relatively compared to the previously proposed visemes, which can be generalized to other languages as well.
Read full abstract