Transformer-Based Multi-Scale Feature Integration Network for Video Saliency Prediction

Xiaofei Zhou,Jiyong Zhang,Songhe Wu,Chenggang Yan,Shuai Wang,Haibing Yin,Ran Shi,Bolun Zheng

doi:10.1109/tcsvt.2023.3278410

Abstract

Most cutting-edge video saliency prediction models rely on spatiotemporal features extracted by 3D convolutions due to its local contextual cues acquirement ability. However, the shortage of 3D convolutions is that it cannot effectively capture long-term spatiotemporal dependencies in videos. To address this limitation, we propose a novel Transformer-based Multi-scale Feature Integration Network (TMFI-Net) for video saliency prediction, where the proposed TMFI-Net consists of a semantic-guided encoder and a hierarchical decoder. Firstly, embarking on the Transformer-based multi-level spatiotemporal features, the semantic-guided encoder enhances the features by inserting the high-level feature into each level feature via a top-down pathway and a longitudinal connection, which endows the multi-level spatiotemporal features with rich contextual information. In this way, the features are steered to give more concerns to saliency regions. Secondly, the hierarchical decoder employs a multi-dimensional attention (MA) module to elevate features along channel, temporal, and spatial dimensions jointly. Successively, the hierarchical decoder deploys a progressive decoding block to conduct an initial saliency prediction, which provides a coarse localization of saliency regions. Lastly, considering the complementarity of different saliency predictions, we integrate all initial saliency prediction results into the final saliency map. Comprehensive experimental results on four video saliency datasets firmly demonstrate that our model achieves superior performance when compared with the state-of-the-art video saliency models. The code is available at https://github.com/wusonghe/TMFI-Net.

Full Text