Abstract

Incorporating the depth (D) information to RGB images has proven the effectiveness and robustness in semantic segmentation. However, the fusion between them is not trivial due to their inherent physical meaning discrepancy, in which RGB represents RGB information but D depth information. In this paper, we propose a co-attention network (CANet) to build sound interaction between RGB and depth features. The key part in the CANet is the co-attention fusion part. It includes three modules. Specifically, the position and channel co-attention fusion modules adaptively fuse RGB and depth features in spatial and channel dimensions. An additional fusion co-attention module further integrates the outputs of the position and channel co-attention fusion modules to obtain a more representative feature which is used for the final semantic segmentation. Extensive experiments witness the effectiveness of the CANet in fusing RGB and depth features, achieving state-of-the-art performance on two challenging RGB-D semantic segmentation datasets, i.e., NYUDv2 and SUN-RGBD.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call