Lipreading Architecture Based on Multiple Convolutional Neural Networks for Sentence-Level Visual Speech Recognition.

Sanghun Jeon,Mun Sang Kim,Ahmed Elsharkawy

doi:10.3390/s22010072

Abstract

In visual speech recognition (VSR), speech is transcribed using only visual information to interpret tongue and teeth movements. Recently, deep learning has shown outstanding performance in VSR, with accuracy exceeding that of lipreaders on benchmark datasets. However, several problems still exist when using VSR systems. A major challenge is the distinction of words with similar pronunciation, called homophones; these lead to word ambiguity. Another technical limitation of traditional VSR systems is that visual information does not provide sufficient data for learning words such as “a”, “an”, “eight”, and “bin” because their lengths are shorter than 0.02 s. This report proposes a novel lipreading architecture that combines three different convolutional neural networks (CNNs; a 3D CNN, a densely connected 3D CNN, and a multi-layer feature fusion 3D CNN), which are followed by a two-layer bi-directional gated recurrent unit. The entire network was trained using connectionist temporal classification. The results of the standard automatic speech recognition evaluation metrics show that the proposed architecture reduced the character and word error rates of the baseline model by 5.681% and 11.282%, respectively, for the unseen-speaker dataset. Our proposed architecture exhibits improved performance even when visual ambiguity arises, thereby increasing VSR reliability for practical applications.

Highlights

Speech is the most common form of communication between humans and involves the perception of both acoustic and visual information
To address the challenges of similar pronunciation and insufficient visual information, this paper presents a novel lipreading architecture that exhibits superior performance compared to those of traditional and existing deep learning visual speech recognition (VSR) systems
We developed a novel lipreading architecture based on end-to-end neural networks that relies exclusively on visual information; We compared the architecture of our proposed model with that of LipNet as the baseline and those of 3D LeNet-5, 3D VGG-F, 3D ResNet-50, and 3D DenseNet-121 to evaluate the reliability of our model for practical applications; We demonstrated improved accuracy and efficiency of the proposed architecture over existing deep learning architectures applied to VSR system implementation

Summary

Introduction

Speech is the most common form of communication between humans and involves the perception of both acoustic and visual information. To address the challenges of similar pronunciation and insufficient visual information, this paper presents a novel lipreading architecture that exhibits superior performance compared to those of traditional and existing deep learning VSR systems. We developed a novel lipreading architecture based on end-to-end neural networks that relies exclusively on visual information; We compared the architecture of our proposed model with that of LipNet as the baseline and those of 3D LeNet-5, 3D VGG-F, 3D ResNet-50, and 3D DenseNet-121 to evaluate the reliability of our model for practical applications; We demonstrated improved accuracy and efficiency of the proposed architecture over existing deep learning architectures applied to VSR system implementation.

Related Work

Deep Learning VSR

Architecture

Spatial-Temporal

Densely

Connectionist

Dataset

Data Pre-Processing and Augmentation

Implementation

Performance Evaluation Metrics

Training Process and Learning Loss

The validation loss of the ba validation losses

Trainingloss andofvalidation loss of overlapped speakers:

WER and CER

Method

Model and Computational Efficiency

Confusion Matrix

Confusion

13. Detailed

Conclusions

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Sensors	Publication Date: Dec 23, 2021
Citations: 14	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Lipreading Architecture Based on Multiple Convolutional Neural Networks for Sentence-Level Visual Speech Recognition.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Sensors

Lead the way for us

Similar Papers

An Improved Visual Speech Recognition of Isolated Words using Combined Pixel and Geometric Features
N Radha ... A Shahina
Indian Journal of Science and Technology | VOL. 9
N Radha, et. al.N Radha ... A Shahina
24 Nov 2016
Indian Journal of Science and Technology | VOL. 9

Appearance and shape-based hybrid visual feature extraction: toward audio–visual automatic speech recognition
Saswati Debnath ... Pinki Roy
Signal, Image and Video Processing | VOL. 15
Saswati Debnath, et. al.Saswati Debnath ... Pinki Roy
11 Jun 2020
Signal, Image and Video Processing | VOL. 15

Multi-Angle Lipreading with Angle Classification-Based Feature Extraction and Its Application to Audio-Visual Speech Recognition
Shinnosuke Isobe ... Masaki Nose
Future Internet | VOL. 13
Shinnosuke Isobe, et. al.Shinnosuke Isobe ... Masaki Nose
15 Jul 2021
Future Internet | VOL. 13

End-to-End Sentence-Level Multi-View Lipreading Architecture with Spatial Attention Module Integrated Multiple CNNs and Cascaded Local Self-Attention-CTC
Sanghun Jeon ... Mun Sang Kim
Sensors | VOL. 22
Sanghun Jeon, et. al.Sanghun Jeon ... Mun Sang Kim
09 May 2022
Sensors | VOL. 22

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Lipreading Architecture Based on Multiple Convolutional Neural Networks for Sentence-Level Visual Speech Recognition.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Sensors