Abstract

RGB-D saliency detection has attracted increasing attention, due to its effectiveness and the fact that depth cues can now be conveniently captured. Existing works often focus on learning a shared representation through various fusion strategies, with few methods explicitly considering how to preserve modality-specific characteristics. In this paper, taking a new perspective, we propose a specificity-preserving network (SP-Net) for RGB-D saliency detection, which benefits saliency detection performance by exploring both the shared information and modality-specific properties (e.g., specificity). Specifically, two modality-specific networks and a shared learning network are adopted to generate individual and shared saliency maps. A cross-enhanced integration module (CIM) is proposed to fuse cross-modal features in the shared learning network, which are then propagated to the next layer for integrating cross-level information. Besides, we propose a multi-modal feature aggregation (MFA) module to integrate the modality-specific features from each individual decoder into the shared decoder, which can provide rich complementary multi-modal information to boost the saliency detection performance. Further, a skip connection is used to combine hierarchical features between the encoder and decoder layers. Experiments on six benchmark datasets demonstrate that our SP-Net outperforms other state-of-the-art methods. Code is available at: https://github.com/taozh2017/SPNet.

Highlights

  • Salient object detection (SOD)1 aims to model the mechanism of human visual attention and locate the most visually distinctive object(s) in a given scene [62]

  • To make full use of the features learned in the modalityspecific decoder, we propose a simple but effective modal feature aggregation (MFA) module to integrate them into the shared decoder

  • It is worth noting that our model obtains better performance on STERE and NLPR than most compared RGB-D saliency detection methods

Read more

Summary

Introduction

Salient object detection (SOD) aims to model the mechanism of human visual attention and locate the most visually distinctive object(s) in a given scene [62]. Fusing RGB and depth images has gained increasing interest in the SOD community [6, 22, 25, 43, 53, 86, 87, 99, 101], and it is a challenging task to adaptively fuse the two modalities (i.e., RGB and depth). The early fusion strategy often adopts a simple concatenation to integrate the two modalities. These methods [54, 62, 69, 73] directly integrate RGB and depth images to form a four-channel input. This type of fusion does not consider the distribution gap between the two modalities, which could result in an inaccurate feature fusion. It is still challenging to capture the complex interactions between the two modalities

F Fusion
RGB Salient Object Detection
RGB-D Salient Object Detection
Multi-modal Learning
Methodology
Overview
Modality-specific Learning Network
Shared Learning Network
Cross-enhanced Integration Module
Multi-modal Feature Aggregation
Loss Function
Experimental Results and Analysis
Datasets
Evaluation Metrics
Implementation Details
Compared RGB-D SOD Models
Quantitative Evaluation
Qualitative Evaluation
Inference Time and Model Size
Method
Ablation Studies
Effectiveness of CIM
Effectiveness of MFA
Effectiveness of Modality-specific Decoder
Effects on Different Numbers of CIM
Attribute-based Evaluation
Failure Cases and Discussion
Application for RGB-D Camouflaged Object Detection
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call