Abstract

Lip reading presents a captivating avenue for advancing speech recognition algorithms, leveraging visual cues from lip movements to recognise spoken words. This paper introduces a novel method employing deep neural networks to convert lip motions into textual representations. The methodology integrates convolutional neural networks for visual feature extraction, recurrent neural networks to capture temporal context, and the Connectionist Temporal Classification loss function for aligning lip features with phonemes. Additionally, dynamic learning rate scheduling and a unique callback mechanism for training visualization are incorporated into the process. Post-training on a sizeable dataset, the model demonstrates notable convergence, showcasing its ability to discern intricate temporal correlations. Comprehensive evaluations, combining quantitative metrics and qualitative assessments, validate the model's effectiveness. Visual inspections of lip reading capabilities and standard speech recognition criteria evaluation highlight its performance. The study delves into the impact of various model topologies and hyperparameters on performance, providing valuable insights for future research directions. This research contributes a deep learning framework for accurate and efficient speech recognition, expanding the landscape of lip reading technologies. The findings open paths for further refinement and deployment across diverse domains, including assistive technologies, audio-visual communication systems, and human-computer interaction. Keywords: lip reading, deep learning, convolutional neural networks, recurrent neural networks, CTC loss, speech recognition

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call