AS-Net: active speaker detection using deep audio-visual attention

Abduljalil Radman,Jorma Laaksonen

doi:10.1007/s11042-024-18457-9

Abstract

Active Speaker Detection (ASD) aims at identifying the active speaker among multiple speakers in a video scene. Previous ASD models often seek audio and visual features from long video clips with a complex 3D Convolutional Neural Network (CNN) architecture. However, models based on 3D CNNs can generate discriminative spatial-temporal features, but this comes at the expense of computational complexity, and they frequently face challenges in detecting active speakers in short video clips. This work proposes the Active Speaker Network (AS-Net) model, a simple yet effective ASD method tailored for detecting active speakers in relatively short video clips without relying on 3D CNNs. Instead, it incorporates the Temporal Shift Module (TSM) into 2D CNNs, facilitating the extraction of dense temporal visual features without the need for additional computations. Moreover, self-attention and cross-attention schemes are introduced to enhance long-term temporal audio-visual synchronization, thereby improving ASD performance. Experimental results demonstrate that AS-Net outperforms state-of-the-art 2D CNN-based methods on the AVA-ActiveSpeaker dataset and remains competitive with the methods utilizing more complex architectures.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

AS-Net: active speaker detection using deep audio-visual attention

Abstract

Talk to us

Similar Papers

More From: Multimedia Tools and Applications

Lead the way for us

Journal: Multimedia Tools and Applications	Publication Date: Feb 5, 2024
License type: CC BY 4.0

Similar Papers

Active Speakers in Context
Juan Leon Alcazar ... Pablo Arbelaez
-
Juan Leon Alcazar, et. al.Juan Leon Alcazar ... Pablo Arbelaez
01 Jun 2020
01 Jun 2020

Audio-video fusion strategies for active speaker detection in meetings
Lionel Pibre ... Isabelle Ferrané
Multimedia Tools and Applications | VOL. 82
Lionel Pibre, et. al.Lionel Pibre ... Isabelle Ferrané
28 Sep 2022
Multimedia Tools and Applications | VOL. 82

Asd-Transformer: Efficient Active Speaker Detection Using Self And Multimodal Transformers
Gourav Datta ... Tyler Etchart
-
Gourav Datta, et. al.Gourav Datta ... Tyler Etchart
23 May 2022
23 May 2022

Learning deep features to recognise speech emotion using merged deep CNN
Jianfeng Zhao ... Lijiang Chen
IET Signal Processing | VOL. 12
Jianfeng Zhao, et. al.Jianfeng Zhao ... Lijiang Chen
01 Aug 2018
IET Signal Processing | VOL. 12

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

AS-Net: active speaker detection using deep audio-visual attention

Abstract

Talk to us

Similar Papers

More From: Multimedia Tools and Applications