Audio-visual Speech Recognition Research Articles

This study embraced the state-of-the-art Branchformer series architecture within the realm of automatic speech recognition, supplanting the widely utilized Conformer architecture. This substitution offers an innovative remedy tailored to audio-visual speech recognition tasks. Building upon the Branchformer architecture, enhancements were made, culminating in the proposal of the Relational-Branchformer (R-Branchformer). The convolutional attention relation module was innovatively incorporated to augment the connectivity between the local and global branches by meticulously considering their interrelations and interplays. Consequently, this module facilitates the mutual embedding of local and global contextual information, ultimately leading to a substantial enhancement in model performance. Our model was grounded in the utilization of the connectionist temporal classification (CTC) loss, wherein intermediate CTC losses were incorporated between blocks. Moreover, through the reference and enhancement of the gated interlayer collaboration module, which superseded the inter CTC module, the conditional independence assumption intrinsic to the CTC model was effectively relaxed. As a consequence, this augmentation markedly bolstered the overall performance of our model. Furthermore, the audio-visual output enhancement module was proposed, which adeptly assimilates information from both audio and visual modalities to enrich the representation of audio-visual information. Consequently, the R-Branchformer model achieved remarkable word error rates of 1.7% and 1.5% on the LRS2 and LRS3 test sets, respectively, exemplifying its state-of-the-art performance in audio-visual speech recognition tasks.

Due to the detrimental impact of noise on the conventional audio speech recognition (ASR) task, audio-visual speech recognition~(AVSR) has been proposed by incorporating both audio and visual video signals. Although existing methods have demonstrated that the aligned visual input of lip movements can enhance the robustness of AVSR systems against noise, the paired videos are not always available during inference, leading to the problem of the missing visual modality, which restricts their practicality in real-world scenarios. To tackle this problem, we propose a Discrete Feature based Visual Generative Model (DFVGM) which exploits semantic correspondences between the audio and visual modalities during training, generating visual hallucinations in lieu of real videos during inference. To achieve that, the primary challenge is to generate the visual hallucination given the noisy audio while preserving semantic correspondences with the clean speech. To tackle this challenge, we start with training the audio encoder in the Audio-Only (AO) setting, which generates continuous semantic features closely associated with the linguistic information. Simultaneously, the visual encoder is trained in the Visual-Only (VO) setting, producing visual features that are phonetically related. Next, we employ K-means to discretize the continuous audio and visual feature spaces. The discretization step allows DFVGM to capture high-level semantic structures that are more resilient to noise and generate visual hallucinations with high quality. To evaluate the effectiveness and robustness of our approach, we conduct extensive experiments on two publicly available datasets. The results demonstrate that our method achieves a remarkable 53% relative reduction (30.5%->12.9%) in Word Error Rate (WER) on average compared to the current state-of-the-art Audio-Only (AO) baselines while maintaining comparable results (< 5% difference) under the Audio-Visual (AV) setting even without video as input.

Audio-visual Speech Recognition Research Articles

Related Topics

Articles published on Audio-visual Speech Recognition

Audiovisual Speech Recognition Method Based on Connectionism

Relational-branchformer: Novel framework for audio-visual speech recognition

Sla-former: conformer using shifted linear attention for audio-visual speech recognition

Audio–visual speech recognition based on regulated transformer and spatio–temporal fusion strategy for driver assistive systems

Restoring Speaking Lips from Occlusion for Audio-Visual Speech Recognition

Visual Hallucination Elevates Speech Recognition

Multichannel AV-wav2vec2: A Framework for Learning Multichannel Multi-Modal Speech Representation

Retraction Note: Audio-Visual Automatic Speech Recognition Towards Education for Disabilities.

KMSAV: Korean multi‐speaker spontaneous audiovisual dataset

Multimodal audiovisual speech recognition architecture using a three‐feature multi‐fusion method for noise‐robust systems

Children's use of spatial and visual cues for release from perceptual masking.

Deep Learning-Based Audio-Visual Speech Recognition for Bosnian Digits

CATNet: Cross-modal fusion for audio–visual speech recognition

Kernel Probabilistic Dependent-Independent Canonical Correlation Analysis

Learn2Talk: 3D Talking Face Learns from 2D Talking Face.

VatLM: Visual-Audio-Text Pre-Training With Unified Masked Prediction for Speech Representation Learning

Exploring the Online Platform Construction of English Movie Audiovisual Speaking Blended Teaching Mode Based on POA Theory of Production-Oriented Approach

Multimodal Sparse Transformer Network for Audio-Visual Speech Recognition.

Usefulness of glottal excitation source information for audio-visual speech recognition system

Generative Adversarial Networks (GANs) for Audio-Visual Speech Recognition in Artificial Intelligence IoT

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Audio-visual Speech Recognition Research Articles

Related Topics

Articles published on Audio-visual Speech Recognition

Audiovisual Speech Recognition Method Based on Connectionism

Relational-branchformer: Novel framework for audio-visual speech recognition

Sla-former: conformer using shifted linear attention for audio-visual speech recognition

Audio–visual speech recognition based on regulated transformer and spatio–temporal fusion strategy for driver assistive systems

Restoring Speaking Lips from Occlusion for Audio-Visual Speech Recognition

Visual Hallucination Elevates Speech Recognition

Multichannel AV-wav2vec2: A Framework for Learning Multichannel Multi-Modal Speech Representation

Retraction Note: Audio-Visual Automatic Speech Recognition Towards Education for Disabilities.

KMSAV: Korean multi‐speaker spontaneous audiovisual dataset

Multimodal audiovisual speech recognition architecture using a three‐feature multi‐fusion method for noise‐robust systems

Children's use of spatial and visual cues for release from perceptual masking.

Deep Learning-Based Audio-Visual Speech Recognition for Bosnian Digits

CATNet: Cross-modal fusion for audio–visual speech recognition

Kernel Probabilistic Dependent-Independent Canonical Correlation Analysis

Learn2Talk: 3D Talking Face Learns from 2D Talking Face.

VatLM: Visual-Audio-Text Pre-Training With Unified Masked Prediction for Speech Representation Learning

Exploring the Online Platform Construction of English Movie Audiovisual Speaking Blended Teaching Mode Based on POA Theory of Production-Oriented Approach

Multimodal Sparse Transformer Network for Audio-Visual Speech Recognition.

Usefulness of glottal excitation source information for audio-visual speech recognition system

Generative Adversarial Networks (GANs) for Audio-Visual Speech Recognition in Artificial Intelligence IoT