Abstract

Existing state-of-the-art RGB-D saliency detection models mainly utilize the depth information as complementary cues to enhance the RGB information. However, depth maps can be easily influenced by environment and hence are full of noises. Thus, indiscriminately integrating multi-modality (i.e., RGB and depth) features may induce noise-degraded saliency maps. In this paper, we propose a novel Adaptive Fusion Network (AFNet) to solve this problem. Specifically, we design a triplet encoder network consisting of three subnetworks to process RGB, depth, and fused features, respectively. The three subnetworks are interlinked and form a grid net to facilitate mutual refinement of these multi-modality features. Moreover, we propose a Multi-modality Feature Interaction (MFI) module to exploit complementary cues between depth and RGB modalities and adaptively fuse the multi-modality features. Finally, we design the Cascaded Feature Interweaved Decoder (CFID) to exploit complementary information between multi-level features and refine them iteratively to achieve accurate saliency detection. Experimental results on six commonly used benchmark datasets verify that the proposed AFNet outperforms 20 state-of-the-art counterparts in terms of six widely adopted evaluation metrics. Source code will be publicly available athttps://github.com/clelouch/AFNet upon paper acceptance.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.