Abstract

Current RGB-D salient object detection (RGB-D SOD) methods mainly develop a generalizable model trained by binary cross-entropy (BCE) loss based on convolutional or Transformer backbones. However, they usually exploit convolutional modules to fuse multi-modality features, with little attention paid to capturing the long-range multi-modality interactions for feature fusion. Furthermore, BCE loss does not explicitly explore intra- and inter-pixel relationships in a joint embedding space. To address these issues, we propose a cross-modality interaction parallel-transformer (CIPT) module, which better captures the long-range multi-modality interactions, generating more comprehensive fusion features. Besides, we propose a pixel-level contrastive learning (PCL) method that improves inter-pixel discrimination and intra-pixel compactness, resulting in a well-structured embedding space and a better saliency detector. Specifically, we propose an asymmetric network (TPCL) for RGB-D SOD, which consists of a Swin V2 Transformer-based backbone and a designed lightweight backbone (LDNet). Moreover, an edge-guided module and a feature enhancement (FE) module are proposed to refine the learned fusion features. Extensive experiments demonstrate that our method achieves excellent performance against 15 state-of-the-art methods on seven public datasets. We expect our work to facilitate the exploration of applying Transformer and contrastive learning for RGB-D SOD tasks. Our codes and predicted saliency maps will be released on GitHub <uri xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">https://github.com/TomorrowJW/TPCL_RGBDSOD</uri> .

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call