Abstract
RGB-D saliency detection integrates information from both RGB images and depth maps to improve the prediction of salient regions under challenging conditions. The key to RGB-D saliency detection is to fully mine and fuse information at multiple scales across the two modalities. Previous approaches tend to apply the multi-scale and multi-modal fusion separately via local operations, which fails to capture long-range dependencies. Here we propose a transformer-based structure to address this issue. The proposed architecture is composed of two modules: an Intra-modality Feature Enhancement Module (IFEM) and an Inter-modality Feature Fusion Module (IFFM). IFFM conducts a sufficient feature fusion by integrating features from multiple scales and two modalities over all positions simultaneously. IFEM enhances feature on each scale by selecting and integrating complementary information from other scales within the same modality before IFFM. We show that transformer is a uniform operation which presents great efficacy in both feature fusion and feature enhancement, and simplifies the model design. Extensive experimental results on five benchmark datasets demonstrate that our proposed network performs favorably against most state-of-the-art RGB-D saliency detection methods. Furthermore, our model is efficient for having relatively smaller FLOPs and model size compared with other methods.
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have