Abstract

From a broader perspective, the objective of Visual Speech Recognition (VSR) is to comprehend the speech spoken by an individual using visual deformations. However, some of the significant limitations of existing solutions include the dearth of training data, improper end-to-end deployed solutions, lack of holistic feature representation, and less accuracy. To resolve these limitations, this study proposes a novel, scalable, and robust VSR system that uses the videotape of the user to determine the word which is being spoken. In this regard, a customized 3-Dimensional Convolutional Neural Network (3D CNN) architecture is proposed by extracting the Spatio-temporal features and eventually mapping the prediction probabilities of the elements in the corpus. We have created a customized dataset resembling the metadata contained in the MIRACL-VC1 dataset to validate the concept of person-independence. While being robust to a broad spectrum of lighting conditions across multiple devices, our model achieves a training accuracy of 80.2% and a testing accuracy of 77.9% in predicting the word spoken by the user.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.