Abstract

RGB-D salient object detection (SOD) can usually be divided into three stages: feature extraction, feature fusion, and feature prediction. Most approaches treat the feature information extracted by the backbone network identically in the final two stages of detection, neglecting the fact that various modalities and different hierarchical features play distinct roles in SOD, resulting in poor detection results. To solve this problem, we propose a transformer-based difference fusion network (TDF-Net) for RGB-D SOD that treats modal features and hierarchical features differently in the feature fusion and feature prediction stages, respectively. First, we adopt the pyramid vision transformer as a feature extractor to obtain hierarchical features from the input RGB images and depth images, respectively. Second, we propose a differential interactive fusion module, in which the RGB modality and the depth modality learn modality-specific features independently, and the two modalities guide each other to fuse features. Finally, we divide the hierarchical features after cross-modal fusion into high-level and low-level features and propose three types of cross-layer fusion modules to discriminately integrate features from different layers to predict the salient maps. Extensive experiments on five benchmark datasets confirm that our proposed TDF-Net outperforms the state-of-the-art methods.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call