Visual Speech Recognition (VSR) is a popular area in computer vision research, attracting interest for its ability to precisely analyze lip motion and seamlessly convert them into textual representation. VSR systems leverage visual features to augment the understanding of automated speech and predict text. VSR finds various applications, including enhancing speech recognition in scenarios with degraded acoustic signals, aiding individuals with hearing impairments, bolstering security by reducing reliance on text based passwords, facilitating biometric authentication for liveness detection, and enabling underwater communications. Despite the various techniques proposed for improving the resilience and precision of automatic speech recognition, VSR still has challenges like homophones of words, gradient descent issues with varying sequence lengths, and lip reading demands accounting for short and long range correlations between consecutive video frames. We have proposed a hybrid network (HNet) with multilayered three dimensional dilated convolution neural network (3D-CNN). The spatio-temporal feature extraction process will be facilitated by dilated 3D-CNN. HNet integrates two bidirectional recurrent neural networks (BiGRU and BiLSTM) to process the feature sequences bidirectionally to establish the temporal relationship. The fusion of BiGRU-BiLSTM capabilities allows the model to process feature sequences more comprehensively and effectively. The proposed work focuses on face based biometric authentication for liveness detection using the VSR model to boost security against face spoofing. The existing face based biometric systems are widely used for individual authentication and verification but are still vulnerable to 3D masks and adversarial attacks. The VSR system can be added to existing face based verification systems as a second level authentication technique to identify a person with liveness. The working ideology of the VSR system will be based on the challenge response technique, where a person has to pronounce the passcode silently displayed on the screen. The VSR model assesses its effectiveness using word error rate (WER), which matches the pronounced passcode to the one presented on the screen. Overall, the proposed work aims to enhance the accuracy of VSR so that it can be combined with existing face based authentication systems. The proposed system outperforms the existing VSR system and obtained 1.3% WER. The significance of the proposed hybrid model is that it efficiently captures temporal dependencies, enhancing context embedding, improving robustness to input variability, reducing information loss, and enhancing performance and accuracy in modeling and analyzing passcode pronunciation patterns.
Read full abstract