Convolutional neural networks (CNNs) have been widely applied to parse indoor scenes and segment objects represented by color images. Nonetheless, the lack of geometric and context information is a problem for most RGB-based methods, with which depth features are only used as an auxiliary module in RGB-D semantic segmentation. In this study, a novel depth dominant network (DDNet) is proposed to fully utilize the rich context information in the depth map. The critical insight is that obvious geometric information from the depth image is more conducive to segmentation than RGB data. Compared with other methods, DDNet is a depth-based network with two branches of CNNs to extract color and depth features. As the core of the encoder network, the depth branch is given a larger fusion weight to extract geometric information, while semantic information and complementary geometric information are provided by the color branch for the depth feature maps. The effectiveness of our proposed depth-based architecture has been demonstrated by comprehensive experimental evaluations and ablation studies on challenging RGB-D semantic segmentation benchmarks, including NYUv2 and a subset of ScanNetv2.
Read full abstract