Abstract
RGB-D saliency detection has attracted increasing attention, due to its effectiveness and the fact that depth cues can now be conveniently captured. Existing works often focus on learning a shared representation through various fusion strategies, with few methods explicitly considering how to preserve modality-specific characteristics. In this paper, taking a new perspective, we propose a specificity-preserving network (SP-Net) for RGB-D saliency detection, which benefits saliency detection performance by exploring both the shared information and modality-specific properties (e.g., specificity). Specifically, two modality-specific networks and a shared learning network are adopted to generate individual and shared saliency maps. A cross-enhanced integration module (CIM) is proposed to fuse cross-modal features in the shared learning network, which are then propagated to the next layer for integrating cross-level information. Besides, we propose a multi-modal feature aggregation (MFA) module to integrate the modality-specific features from each individual decoder into the shared decoder, which can provide rich complementary multi-modal information to boost the saliency detection performance. Further, a skip connection is used to combine hierarchical features between the encoder and decoder layers. Experiments on six benchmark datasets demonstrate that our SP-Net outperforms other state-of-the-art methods. Code is available at: https://github.com/taozh2017/SPNet.
Highlights
Salient object detection (SOD)1 aims to model the mechanism of human visual attention and locate the most visually distinctive object(s) in a given scene [62]
To make full use of the features learned in the modalityspecific decoder, we propose a simple but effective modal feature aggregation (MFA) module to integrate them into the shared decoder
It is worth noting that our model obtains better performance on STERE and NLPR than most compared RGB-D saliency detection methods
Summary
Salient object detection (SOD) aims to model the mechanism of human visual attention and locate the most visually distinctive object(s) in a given scene [62]. Fusing RGB and depth images has gained increasing interest in the SOD community [6, 22, 25, 43, 53, 86, 87, 99, 101], and it is a challenging task to adaptively fuse the two modalities (i.e., RGB and depth). The early fusion strategy often adopts a simple concatenation to integrate the two modalities. These methods [54, 62, 69, 73] directly integrate RGB and depth images to form a four-channel input. This type of fusion does not consider the distribution gap between the two modalities, which could result in an inaccurate feature fusion. It is still challenging to capture the complex interactions between the two modalities
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.