Abstract
RGB-D SOD uses depth information to handle challenging scenes and obtain high-quality saliency maps. Existing state-of-the-art RGB-D saliency detection methods overwhelmingly rely on the strategy of directly fusing depth information. Although these methods improve the accuracy of saliency prediction through various cross-modality fusion strategies, misinformation provided by some poor-quality depth images can affect the saliency prediction result. To address this issue, a novel RGB-D salient object detection model (SiaTrans) is proposed in this paper, which allows training on depth image quality classification at the same time as training on SOD. In light of the common information between RGB and depth images on salient objects, SiaTrans uses a Siamese transformer network with shared weight parameters as the encoder and extracts RGB and depth features concatenated on the batch dimension, saving space resources without compromising performance. SiaTrans uses the class token in the backbone network (T2T-ViT) to classify the quality of depth images without preventing the token sequence from going on with the saliency detection task. The greatest benefit of our cross-modality fusion (CMF) and decoder is that they maintain consistency between RGB and RGB-D information decoding. In the test, SiaTrans decides whether to perform an RGB-D or RGB saliency detection task according to the quality classification signal of the depth image. Comprehensive experiments on nine RGB-D SOD benchmark datasets show that SiaTrans has the best overall performance and the least computation compared with recent state-of-the-art methods. • The proposed Siamese transformer structure (SiaTrans) can save space resources and give full play to the GPU computing power. • SiaTrans can realize RGB SOD task, RGB-D SOD task, and classify depth images according to depth image quality. • Siatrans uses one network to achieve three visual tasks while the amount of computation and parameters remain at a low level.
Submitted Version (Free)
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have