Abstract

Neuroscience study verifies that synchronized audiovisual stimuli would make a stronger response of visual perception than an independent stimulus. Many researches show that audio signals would affect human gaze behavior in the viewing of natural video scenes. Thus in this paper, we propose a multi-sensory framework of audio and visual signals for video saliency prediction. It mainly includes four modules: auditory feature extraction, visual feature extraction, semantic interaction between auditory feature and visual feature, and feature fusion. With the inputs of audio and visual signals, we present a network architecture of deep learning to undertake the tasks of these four modules. It is an end-to-end architecture that could interact the semantics from its learned features of audio and visual stimuli. The numerical and visual results show our method achieves a significant improvement over eleven recent saliency models that are regardless of the audio stimuli, even some of them are state-of-the-art deep learning models.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call