Deep learning methods for salient object detection (SOD) have been studied actively and promisingly. However, the existing methods mainly focus on the decoding process and ignore the differences in the contributions of different encoder blocks. To address this problem for SOD, we propose an adaptive multi-content complementary network (PASNet) for salient object detection which aims to exploit the valuable contextual information in the encoder fully. Unlike existing CNN-based methods, we adopt the pyramidal visual transformer (PVTv2) as the backbone network to learn global and local representations with its self-attention mechanism. Then, we follow the coarse-to-fine strategy and introduce two novel modules, including an advanced semantic fusion module (ASFM) and a self-refinement module (SRM). Among these, the ASFM takes local branches and adjacent branches as inputs and collects semantic and location information of salient objects from high-level features to generate an initial coarse saliency map. The coarse saliency map serves as the location guidance for low-level features, and the SRM is applied to capture detailed information disguised in low-level features. We expand the location information with high-level semantics from top to bottom across the salient region, which is effectively fused with detailed information through feature modulation. The model effectively suppresses noises in the features and significantly improves their expressive capabilities. To verify the effectiveness of our PASNet, we conducted extensive experiments on five challenging datasets, and the results show that the proposed model is superior to some of the current state-of-the-art methods under different evaluation metrics.
Read full abstract