Audio-visual saliency prediction with multisensory perception and integration

Jiawei Xie,Zhi Liu,Gongyang Li,Yingjie Song

doi:10.1016/j.imavis.2024.104955

Abstract

Audio-visual saliency prediction (AVSP) is a task that aims to model human attention patterns in the perception of auditory and visual scenes. Given the challenges associated with perceiving and combining multi-modal saliency features from videos, this paper presents a multi-sensory framework for AVSP. This framework is designed to extract audio, motion and image saliency features and integrate them effectively, which can then serve as a general architecture for the AVSP task. To obtain multi-sensory information, we develop a three-stream encoder that extracts audio, motion and image saliency features. In particular, we utilize a pre-trained encoder with knowledge related to image saliency to extract saliency features for each frame. The image saliency features are then incorporated with motion features using a spatial attention module. For motion features, 3D convolutional neural networks (CNNs) like S3D are commonly used in AVSP models. However, these networks are unable to effectively capture the global motion relationship in videos. To tackle this problem, we incorporate Transformer- and MLP-based motion encoders into the AVSP models. To learn joint audio-visual representations, an audio-visual fusion block is exploited to enhance the correlation between audio and visual motion features under the supervision of a cosine similarity loss in a self-supervised manner. Finally, a multi-stage decoder integrates audio, motion and image saliency features to generate the final saliency map. We evaluate our methods on six audio-visual eye-tracking datasets. Experimental results demonstrate that our method achieves compelling performance compared to the state-of-the-art methods. The source code is available at https://github.com/oraclefina/MSPI.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Audio-visual saliency prediction with multisensory perception and integration

Abstract

Talk to us

Similar Papers

More From: Image and Vision Computing

Lead the way for us

Journal: Image and Vision Computing	Publication Date: Feb 23, 2024
Citations: 1

Similar Papers

A Combined Motion-Audio School Bullying Detection Algorithm
Liang Ye ... Hany Ferdinando
International Journal of Pattern Recognition and Artificial Intelligence | VOL. 32
Liang Ye, et. al.Liang Ye ... Hany Ferdinando
27 Aug 2018
International Journal of Pattern Recognition and Artificial Intelligence | VOL. 32

Stacking ensemble learning models for daily runoff prediction using 1D and 2D CNNs
Yutong Xie ... Xingyou Pan
Expert Systems with Applications | VOL. 217
Yutong Xie, et. al.Yutong Xie ... Xingyou Pan
24 Dec 2022
Expert Systems with Applications | VOL. 217

Multi-Scale correlation module for video-based facial expression recognition in the wild
Tankun Li ... Tardi Tjahjadi
Pattern Recognition | VOL. 142
Tankun Li, et. al.Tankun Li ... Tardi Tjahjadi
13 May 2023
Pattern Recognition | VOL. 142

Classification of bird species from video using appearance and motion features
John Atanbori ... Patrick Dickinson
Ecological Informatics | VOL. 48
John Atanbori, et. al.John Atanbori ... Patrick Dickinson
18 Jul 2018
Ecological Informatics | VOL. 48

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Audio-visual saliency prediction with multisensory perception and integration

Abstract

Talk to us

Similar Papers

More From: Image and Vision Computing