Abstract

Red-green-blue and depth (RGB-D) semantic segmentation is essential for indoor service robots to achieve accurate artificial intelligence. Various RGB-D indoor semantic segmentation methods have been proposed since the widespread adoption of depth maps. These methods have focused mainly on integrating the multiscale and crossmodal features extracted from RGB images and depth maps in the encoder and used unified strategies to recover the local details at the decoder progressively. However, these methods emphasized crossmodal fusion at the encoder, neglecting the distinguishability between RGB and depth features during decoding, thereby undermining the segmentation performance. To adequately exploit the features, we propose an efficient encoder-decoder architecture called asymmetric multiscale and crossmodal fusion network (AMCFNet) endowed with a differential feature integration strategy. Unlike existing methods, we use simple crossmodal fusion at the encoder and design an elaborate decoder to improve the semantic segmentation performance. Specifically, considering high- and low-level features, we propose a semantic aggregation module (SAM) to process the multiscale and crossmodal features in the last three network layers to aggregate high-level semantic information through a cascaded pyramid structure. Moreover, we design a spatial detail supplement module using low-level spatial details from depth maps to adaptively fuse these details and the information obtained from the SAM. Extensive experiments are conducted to demonstrate that the proposed AMCFNet outperforms state-of-the-art approaches.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call