Abstract

From a broader perspective, the objective of Visual Speech Recognition (VSR) is to comprehend the speech spoken by an individual using visual deformations. However, some of the significant limitations of existing solutions include the dearth of training data, improper end-to-end deployed solutions, lack of holistic feature representation, and less accuracy. To resolve these limitations, this study proposes a novel, scalable, and robust VSR system that uses the videotape of the user to determine the word which is being spoken. In this regard, a customized 3-Dimensional Convolutional Neural Network (3D CNN) architecture is proposed by extracting the Spatio-temporal features and eventually mapping the prediction probabilities of the elements in the corpus. We have created a customized dataset resembling the metadata contained in the MIRACL-VC1 dataset to validate the concept of person-independence. While being robust to a broad spectrum of lighting conditions across multiple devices, our model achieves a training accuracy of 80.2% and a testing accuracy of 77.9% in predicting the word spoken by the user.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call