Audio-Visual Salient Object Detection

Shuaiyang Cheng,Jingjing Tang,Shihui Guo,Liang Song

doi:10.1007/978-3-030-84529-2_43

Abstract

This paper studies audio-visual salient object detection. The task of salient object detection is to detect and mark the objects that are most concerned by people in the visual scene. Traditionally, visual salient object detection uses only images or video frames to detect salient objects, without modeling human multi-modal perception which includes the interaction between vision and hearing. Therefore, in order to improve the visual salient object detection, we incorporate audio modality into the traditional visual salient object detection task by applying a two-stream audio-visual deep learning network. To this end, we also build an audio-visual salient object detection dataset called AVSOD based on the existing dataset. To verify the effectiveness of audio modality in salient object detection, we compare the experimental performance of the deep learning model with and without audio modality. The experimental results demonstrate that audio modality has a good supplementary effect on the task of visual salient object detection, and also verified the effectiveness of the proposed dataset.

Full Text