Abstract

Recently, Salient Object Detection (SOD) has achieved promising results thanks to the rapid evolution of CNN architectures. However, those CNN-based methods show limited capacity in modeling long-range interaction among pixels. In this paper, we propose to combine the merits from CNN and Transformer and design a unified model for RGB and RGB-D SOD. First, we represent the image with a sequence-to-sequence perspective and use a Transformer-based branch to express long-distance relationships of image tokens to obtain global semantic information and predict a coarse saliency map. Second, we employ a CNN-based branch to extract multi-scale local detail features to predict contour prediction for auxiliary supervision at each level. Finally, we propose the Bi-enhancement Fusion Module to fuse multi-scale cues from two branches to predict a more accurate saliency map. In addition, for RGB-D SOD, to obtain effective cross-modality features, we propose a Cross-modality Multi-Scale Transformer Module and a Depth-induced Enhancement Module to fuse RGB and depth cues in the Transformers branch and the CNNs branch, respectively. Experiments on both RGB and RGB-D SOD datasets demonstrate that our proposed model achieves satisfactory performance compared with state-of-the-art methods.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call