A multi-purpose audio-visual corpus for multi-modal Persian speech recognition: The Arman-AV dataset

Javad Peymanfard,Samin Heydarian,Ali Lashini,Hossein Zeinali,Mohammad Reza Mohammadi,Nasser Mozayani

doi:10.1016/j.eswa.2023.121648

Abstract

Automatic lip reading has advanced significantly in recent years. However, these methods need large-scale datasets that are scarce for many low-resource languages. In this paper, we introduce a new multipurpose audio-visual dataset for Persian. The dataset contains approximately 220 h of videos from 1760 speakers. The dataset can be used for multiple tasks, such as lip reading, automatic speech recognition, audio-visual speech recognition, and speaker recognition. It is also the first large-scale lip reading dataset in this language. We provide a baseline method for each task and propose a technique to identify visemes (visual units of speech) in Persian. The visemes obtained by this technique improve the accuracy of the lip reading task by 7% relatively compared to the previously proposed visemes, which can be generalized to other languages as well.

Full Text