Abstract

RGB-D co-salient object detection aims to segment co-occurring salient objects when given a group of relevant images and depth maps. Previous methods often adopt separate pipeline and use hand-crafted features, being hard to capture the patterns of co-occurring salient objects and leading to unsatisfactory results. Using end-to-end CNN models is a straightforward idea, but they are less effective in exploiting global cues due to the intrinsic limitation. Thus, in this paper, we alternatively propose an end-to-end transformer-based model which uses class tokens to explicitly capture implicit class knowledge to perform RGB-D co-salient object detection, denoted as CTNet. Specifically, we first design adaptive class tokens for individual images to explore intra-saliency cues and then develop common class tokens for the whole group to explore inter-saliency cues. Besides, we also leverage the complementary cues between RGB images and depth maps to promote the learning of the above two types of class tokens. In addition, to promote model evaluation, we construct a challenging and large-scale benchmark dataset, named RGBD CoSal1k, which collects 106 groups containing 1000 pairs of RGB-D images with complex scenarios and diverse appearances. Experimental results on three benchmark datasets demonstrate the effectiveness of our proposed method.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.