Abstract

Indoor semantic segmentation is a long-standing vision task that has been recently advanced by convolutional neural networks (CNNs), but this task remains challenging by high occlusion and large scale variation of indoor scenes. Existing CNN-based methods mainly focus on using auxiliary depth data to enrich features extracted from RGB images, hence, they pay less attention to exploiting multi-scale information in exracted features, which is essential for distinguishing objects in highly cluttered indoor scenes. This paper proposes a deep cross-scale feature propagation network (CSNet), to effectively learn and fuse multi-scale features for robust semantic segmentation of indoor scene images. The proposed CSNet is deployed as an encoder-decoder engine. During encoding, the CSNet propagates contextual information across scales and learn discriminative multi-scale features, which are robust to large object scale variation and indoor occlusion. The decoder of CSNet then adaptively integrates the multi-scale encoded features with fusion supervision at all scales to generate target semantic segmentation prediction. Extensive experiments conducted on two challenging benchmarks demonstrate that the CSNet can effectively learn multi-scale representations for robust indoor semantic segmentation, achieving outstanding performance with mIoU scores of 51.5 and 50.8 on NYUDv2 and SUN-RGBD datasets, respectively.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.