Sla-former: conformer using shifted linear attention for audio-visual speech recognition

Yewei Xiao,Jian Huang,Xuanming Liu,Aosu Zhu

doi:10.1007/s40747-024-01451-x

Abstract

Conformer-based models have proven highly effective in Audio-visual Speech Recognition, integrating auditory and visual inputs to significantly enhance speech recognition accuracy. However, the widely utilized softmax attention mechanism within conformer models encounters scalability issues, with its spatial and temporal complexity escalating quadratically with sequence length. To address these challenges, this paper introduces the Shifted Linear Attention Conformer, an evolved iteration of the conformer architecture. Shifted Linear Attention Conformer adopts shifted linear attention as a scalable alternative to softmax attention. We conducted a thorough analysis of the factors constraining the efficiency of linear attention. To mitigate these issues, we propose the utilization of a straightforward yet potent mapping function and an efficient rank restoration module, enhancing the effectiveness of self-attention while maintaining low computational complexity. Furthermore, we integrate an advanced attention-shifting technique facilitating token manipulation within attentional mechanisms, thereby enhancing information flow across various groups. This three-part approach enhances cognitive computations, particularly beneficial for processing longer sequences. Our model achieves exceptional Word Error Rates of 1.9% and 1.5% on the Lip Reading Sentences 2 and Lip Reading Sentences 3 datasets, respectively, showcasing its state-of-the-art performance in audio-visual speech recognition tasks.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Sla-former: conformer using shifted linear attention for audio-visual speech recognition

Abstract

Talk to us

Similar Papers

More From: Complex & Intelligent Systems

Lead the way for us

Journal: Complex & Intelligent Systems	Publication Date: May 18, 2024
License type: CC BY 4.0

Similar Papers

Audio-Visual Speech and Gesture Recognition by Sensors of Mobile Devices.
Dmitry Ryumin ... Denis Ivanko
Sensors | VOL. 23
Dmitry Ryumin, et. al.Dmitry Ryumin ... Denis Ivanko
17 Feb 2023
Sensors | VOL. 23

A Review of Recent Advances on Deep Learning Methods for Audio-Visual Speech Recognition
Denis Ivanko ... Alexey Karpov
Mathematics | VOL. 11
Denis Ivanko, et. al.Denis Ivanko ... Alexey Karpov
12 Jun 2023
Mathematics | VOL. 11

Measuring the effect of high-speed video data on the audio-visual speech recognition accuracy
D V Ivanko ... D A Ryumin
Information and Control Systems | VOL. -
D V Ivanko, et. al.D V Ivanko ... D A Ryumin
19 Apr 2019
Information and Control Systems | VOL. -

Visual Reliance During Speech Recognition in Cochlear Implant Users and Candidates.
Kara J Vasil ... Aaron C Moberly
Journal of the American Academy of Audiology | VOL. 31
Kara J Vasil, et. al.Kara J Vasil ... Aaron C Moberly
01 Jan 2020
Journal of the American Academy of Audiology | VOL. 31

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Sla-former: conformer using shifted linear attention for audio-visual speech recognition

Abstract

Talk to us

Similar Papers

More From: Complex & Intelligent Systems