PATNet: Patch-to-pixel attention-aware transformer network for RGB-D and RGB-T salient object detection

Mingfeng Jiang,Jianhua Ma,Jiatong Chen,Yaming Wang,Xian Fang

doi:10.1016/j.knosys.2024.111597

Abstract

Multimodal salient object detection (SOD) combines different modal images to generate the most visually appealing saliency map. When fusing multimodal and multiscale features, maintaining the integrity and fine granularity of the target is critical for improving the performance of multimodal SOD. The fine-grained information differences between the modalities and the size of the features in the transformer prevent most existing studies from guaranteeing both granularities. Therefore, we propose a patch-to-pixel attention-aware transformer network (PATNet) to overcome these problems, whereby the integrity and fine-grained details of the saliency map are preserved by employing a decision-transformation strategy to map global patches onto local pixels. Specifically, PATNet consists of the shared attention fusion module (SAFM), adjacent modeling fusion module (AMFM), and fine-grained mapping module (FMM). SAFM enhances the consistency between multimodal features through a shared attention matrix and an identical convolutional feed-forward network. Meanwhile, AMFM enhances low-resolution features by modeling neighboring features to avoid the aliasing effect of upsampling. In the output stage, FMM is responsible for mapping the feature maps represented by patches onto pixels and restoring the salient object details. Numerous experimental results demonstrate that PATNet outperforms 24 state-of-the-art methods on six RGB-D and three RGB-T datasets. The source code is publicly available at https://github.com/LitterMa-820/PATNet.

Full Text