Visual Speech Recognition Research Articles

To compare speech perception (SP) in noise for normal-hearing (NH) individuals and individuals with hearing loss (IWHL) and to demonstrate improvements in SP with use of a visual speech recognition program (VSRP). Single-institution prospective study. Tertiary referral center. Eleven NH and 9 IWHL participants in a sound-isolated booth facing a speaker through a window. In non-VSRP conditions, SP was evaluated on 40 Bamford-Kowal-Bench speech-in-noise test (BKB-SIN) sentences presented by the speaker at 50 A-weighted decibels (dBA) with multiperson babble noise presented from 50 to 75 dBA. SP was defined as the percentage of words correctly identified. In VSRP conditions, an infrared camera was used to track 35 points around the speaker's lips during speech in real time. Lip movement data were translated into speech-text via an in-house developed neural network-based VSRP. SP was evaluated similarly in the non-VSRP condition on 42 BKB-SIN sentences, with the addition of the VSRP output presented on a screen to the listener. In high-noise conditions (70-75 dBA) without VSRP, NH listeners achieved significantly higher speech perception than IWHL listeners (38.7% vs 25.0%, P = .02). NH listeners were significantly more accurate with VSRP than without VSRP (75.5% vs 38.7%, P < .0001), as were IWHL listeners (70.4% vs 25.0% P < .0001). With VSRP, no significant difference in SP was observed between NH and IWHL listeners (75.5% vs 70.4%, P = .15). The VSRP significantly increased speech perception in high-noise conditions for NH and IWHL participants and eliminated the difference in SP accuracy between NH and IWHL listeners.

Abstract The Visual Speech Recognition (VSR) system performance is highly influenced by the selection of visual features. These features are categorized into static and dynamic features. This work proposes to exploit both lip shape (static-geometric features) as well as the temporal sequence of lip movements (dynamic-motion features) to build a combined VSR system with fusion both at feature level and model level. The digit dataset for VSR system is evaluated on the benchmark (using Discrete Wavelet Transform (DWT), Discrete Cosine Transform (DCT), and Zernike Moments (ZM)) systems. First, the Motion History Image (MHI) is calculated from all visemes from which wavelet and Zernike coefficients are extracted and modeled using a simple GMM L-R HMM. This proposed method shows a significant improvement in performance of 85% for MHI-DWT based features, 74% for MHI-DCT and 80% for MHI-ZM features. Geometric features are extracted using an Active Shape Model (ASM). Two types of fusion, namely feature fusion and model fusion are used. In feature level fusion, the motion features (MHI-DWT, MHI-DCT, and MHI-ZM) with geometric features (ASM) and modeled using GMM L-R HMM. The performance improves for combined features with an accuracy of 96.5% for DWT-ASM, 84% for DCT-ASM, and 93% for ZM-ASM. Model level fusion is performed using a two stream HMM model with stream weight of DWT-ASM, DCT-ASM, and ZM-ASM features. A weighted model level fusion results in further improvement, with an accuracy of 98.2% for DWT-ASM, 85% for DCT-ASM and 94.5% for ZM-ASM. The proposed work result achieves high recognition for VSR systems compared to the benchmark systems (DWT, DCT, and ZM).

Visual Speech Recognition Research Articles

Articles published on Visual Speech Recognition

Improving the Recognition Performance of Lip Reading Using the Concatenated Three Sequence Keyframe Image Technique

Visual Speech Recognition using Convolutional Neural Network

Biosignal Sensors and Deep Learning-Based Speech Recognition: A Review.

Visual Speech Recognition with Lightweight Psychologically Motivated Gabor Features.

Audio–Visual Speech Recognition Based on Dual Cross-Modality Attentions with the Transformer Model

Effect of Various Visual Speech Units on Language Identification Using Visual Speech Recognition

A Survey on Bayesian Deep Learning

Guest Editorial Multimedia Computing With Interpretable Machine Learning

Appearance and shape-based hybrid visual feature extraction: toward audio–visual automatic speech recognition

Visual Speech Recognition: Improving Speech Perception in Noise through Artificial Intelligence.

Harnessing GANs for Zero-Shot Learning of New Classes in Visual Speech Recognition

The Geometrical Based Lip-Reading Techniques of Multi-Dimensional Dynamic Time Warping MDTW and Hidden Markov Models HMMs in the Audio Visual Speech Recognition

End-to-end visual speech recognition for small-scale datasets

Visual Speech Recognition using Fusion of Motion and Geometric Features

A Survey of Research on Lipreading Technology

Speaker-Independent Speech Recognition using Visual Features

Cross-Domain Deep Visual Feature Generation for Mandarin Audio–Visual Speech Recognition

Dorsal-movement and ventral-form regions are functionally connected during visual-speech recognition.

Visual Speech Recognition for Daily Indonesian Words Based on Combination of Double Difference and Image Projection Method

Dorsal face-movement and ventral face-form regions are functionally connected during visual—speech recognition

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Visual Speech Recognition Research Articles

Articles published on Visual Speech Recognition

Improving the Recognition Performance of Lip Reading Using the Concatenated Three Sequence Keyframe Image Technique

Visual Speech Recognition using Convolutional Neural Network

Biosignal Sensors and Deep Learning-Based Speech Recognition: A Review.

Visual Speech Recognition with Lightweight Psychologically Motivated Gabor Features.

Audio–Visual Speech Recognition Based on Dual Cross-Modality Attentions with the Transformer Model

Effect of Various Visual Speech Units on Language Identification Using Visual Speech Recognition

A Survey on Bayesian Deep Learning

Guest Editorial Multimedia Computing With Interpretable Machine Learning

Appearance and shape-based hybrid visual feature extraction: toward audio–visual automatic speech recognition

Visual Speech Recognition: Improving Speech Perception in Noise through Artificial Intelligence.

Harnessing GANs for Zero-Shot Learning of New Classes in Visual Speech Recognition

The Geometrical Based Lip-Reading Techniques of Multi-Dimensional Dynamic Time Warping MDTW and Hidden Markov Models HMMs in the Audio Visual Speech Recognition

End-to-end visual speech recognition for small-scale datasets

Visual Speech Recognition using Fusion of Motion and Geometric Features

A Survey of Research on Lipreading Technology

Speaker-Independent Speech Recognition using Visual Features

Cross-Domain Deep Visual Feature Generation for Mandarin Audio–Visual Speech Recognition

Dorsal-movement and ventral-form regions are functionally connected during visual-speech recognition.

Visual Speech Recognition for Daily Indonesian Words Based on Combination of Double Difference and Image Projection Method

Dorsal face-movement and ventral face-form regions are functionally connected during visual—speech recognition