Abstract

Video saliency prediction (VSP) aims to imitate eye fixations of humans. However, the potential of this task has not been fully exploited since existing VSP methods only focus on modeling visual saliency of the input previous frames. In this paper, we present the first attempt to extend this task to video saliency forecasting (VSF) by forecasting attention regions of consecutive future frames. To tackle this problem, we propose a video saliency forecasting transformer (VSFT) network built on a new encoder-decoder architecture. Different from existing VSP methods, our VSFT is the first pure-transformer based architecture in the VSP field and is freed from the dependency of the pretrained S3D model. In VSFT, the attention mechanism is exploited to capture spatial-temporal dependencies between the input past frames and the target future frame. We propose cross-attention guidance blocks (CAGB) to aggregate multi-level representation features to provide sufficient guidance for forecasting. We conduct comprehensive experiments on two benchmark datasets, DHF1K and Hollywoods-2. We investigate the saliency forecasting and predicting abilities of existing VSP methods by modifying the supervision signals. Experimental results demonstrate that our method achieves superior performance on both VSF and VSP tasks.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call