Relational-branchformer: Novel framework for audio-visual speech recognition

Yewei Xiao,Xuanming Liu,Aosu Zhu,Jian Huang

doi:10.1016/j.imavis.2024.105182

Abstract

This study embraced the state-of-the-art Branchformer series architecture within the realm of automatic speech recognition, supplanting the widely utilized Conformer architecture. This substitution offers an innovative remedy tailored to audio-visual speech recognition tasks. Building upon the Branchformer architecture, enhancements were made, culminating in the proposal of the Relational-Branchformer (R-Branchformer). The convolutional attention relation module was innovatively incorporated to augment the connectivity between the local and global branches by meticulously considering their interrelations and interplays. Consequently, this module facilitates the mutual embedding of local and global contextual information, ultimately leading to a substantial enhancement in model performance. Our model was grounded in the utilization of the connectionist temporal classification (CTC) loss, wherein intermediate CTC losses were incorporated between blocks. Moreover, through the reference and enhancement of the gated interlayer collaboration module, which superseded the inter CTC module, the conditional independence assumption intrinsic to the CTC model was effectively relaxed. As a consequence, this augmentation markedly bolstered the overall performance of our model. Furthermore, the audio-visual output enhancement module was proposed, which adeptly assimilates information from both audio and visual modalities to enrich the representation of audio-visual information. Consequently, the R-Branchformer model achieved remarkable word error rates of 1.7% and 1.5% on the LRS2 and LRS3 test sets, respectively, exemplifying its state-of-the-art performance in audio-visual speech recognition tasks.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Relational-branchformer: Novel framework for audio-visual speech recognition

Abstract

Talk to us

Similar Papers

More From: Image and Vision Computing

Lead the way for us

Similar Papers

HDAM: Heuristic Difference Attention Module for Convolutional Neural Networks
Yu Xue ... Ziming Yuan
Journal on Internet of Things | VOL. 4
Yu Xue, et. al.Yu Xue ... Ziming Yuan
01 Jan 2021
Journal on Internet of Things | VOL. 4

Direct Acoustics-to-Word Models for English Conversational Speech Recognition
Kartik Audhkhasi ... Michael Picheny
-
Kartik Audhkhasi, et. al.Kartik Audhkhasi ... Michael Picheny
20 Aug 2017
20 Aug 2017

An empirical exploration of CTC acoustic models
Yajie Miao ... Tom Ko
-
Yajie Miao, et. al.Yajie Miao ... Tom Ko
01 Mar 2016
01 Mar 2016

Keyword Extraction Using Support Vector Machine
Kuo Zhang ... Hui Xu
-
Kuo Zhang, et. al.Kuo Zhang ... Hui Xu
01 Jan 2006
01 Jan 2006

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Relational-branchformer: Novel framework for audio-visual speech recognition

Abstract

Talk to us

Similar Papers

More From: Image and Vision Computing