SalDA: DeepConvNet Greets Attention for Visual Saliency Prediction

Yihan Tang,Zhengwei Wang,Pan Gao

doi:10.1109/tcds.2023.3274179

Abstract

Predicting salient regions in images requires the capture of contextual information in the scene. Conventional saliency models typically use the encoder-decoder architecture and multi-scale feature fusion for modeling contextual features, which, however, possess huge computational cost and model parameters. In this paper, we address the saliency prediction task by capturing long-range dependencies based on the self-attention mechanism. Self-attention has been widely used in image recognition or other classification tasks, but is still rarely being considered in regression based saliency prediction task. Inspired by non-local block, we propose a new SALiency prediction network in which DeepConvNet is integrated with the Attention mechanism, namely SalDA. Considering each feature map may capture different salient regions, our spatial attention module firstly adaptively aggregates the feature at each position by a weighted sum of the features at all positions within each independent channel. Meanwhile, in order to capture interdependence between channels, we also introduce a channel attention module to integrate different features among different channels. We combine these two attention modules into a multi-attention module to further improve the saliency map prediction for the network. We show the effectiveness of SalDA on the largest saliency prediction dataset SALICON. Compared to other state-of-the-art methods in this area, we can yield comparable saliency prediction performance, but with substantially less model parameters and shorter inference time.

Full Text