Review of Various Machine Learning and Deep Learning Techniques for Audio Visual Automatic Speech Recognition

Arpita Choudhury,Pinki Roy,Sivaji Bandyopadhyay

doi:10.1109/isacc56298.2023.10084209

Abstract

The visual cues obtained from the face and mouth region of a speaker provide valuable information for speech per-ception. The idea of audio visual speech recognition is to combine visual information with acoustic speech signals to enhance the intelligibility of speech in the presence of ambient noises. In audio visual speech recognition lip image sequences of speakers are used along with acoustic signals to convert speech into text. Researchers are exploring ways to upgrade the performance of audio visual speech recognition and solve certain real life problems like designing voice dialling systems, highly secured biometric systems for authentication etc. A review of the latest research findings on audio visual automatic speech recognition using traditional machine learning, neural networks and other deep learning techniques is presented in this work. This paper describes future research opportunities through a comparative analysis of the various techniques used in the literature for the different stages of audiovisual speech recognition, including the region of interest detection, audio and visual speech feature extraction and fusion of the modalities.

Full Text