Abstract

The human visual system effectively analyzes scenes based on local, global and semantic properties. Deep learning-based saliency prediction models adopted two-stream networks, leveraged prior knowledge of global semantics, or added long-range dependency modeling structures like transformers to incorporate global saliency information. However, they either brought high complexity to learning local and global features or neglected the design for enhancing local features. In this paper, we propose a Global Semantic-Guided Network (GSGNet), which first enriches global semantics through a modified transformer block and then incorporates semantic information into visual features from local and global perspectives in an efficient way. Multi-head self-attention in transformers captures global features, but lacks information communication within and between feature subspaces (heads) when computing the similarity matrix. To learn global representations and enhance interactions of the subspaces, we propose a Channel-Squeeze Spatial Attention (CSSA) module to emphasize channel-relevant information in a compression manner and learn global spatial relationships. To better fuse local and global contextual information, we propose a hybrid CNN-Transformer block called local–global fusion block (LGFB) for aggregating semantic features simply and efficiently. Experimental results on four public datasets demonstrate that our model achieves compelling performance compared with the state-of-the-art saliency prediction models on various evaluation metrics.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call