This paper focuses on the problem of RGB-D semantic segmentation for indoor scenes. We introduce a novel gravity direction detection method based on vertical lines fitting combined 2D vision information and 3D geometric information to improve the original HHA depth encoding. Then to fuse two-stream networks of deep convolutional networks from RGB and depth encoding, we propose a joint modelling method by learning a weighted summing layer to fuse the prediction results. Finally, to refine the pixel-wise score maps, we adopt fully-connected CRF as a post-processing and propose a pairwise potential function combined normal kernel to explore geometric information. Experimental results show our proposed approach achieves state-of-the-art performance of RGB-D semantic segmentation on public dataset.
Read full abstract