Dense RGB-D Semantic Mapping with Pixel-Voxel Neural Network.

Cheng Zhao,Tom Duckett,Rustam Stolkin,Li Sun,Pulak Purkait

doi:10.3390/s18093099

Cheng Zhao, Tom Duckett + Show 3 more

Open Access

https://doi.org/10.3390/s18093099

Copy DOI

Abstract

In this paper, a novel Pixel-Voxel network is proposed for dense 3D semantic mapping, which can perform dense 3D mapping while simultaneously recognizing and labelling the semantic category each point in the 3D map. In our approach, we fully leverage the advantages of different modalities. That is, the PixelNet can learn the high-level contextual information from 2D RGB images, and the VoxelNet can learn 3D geometrical shapes from the 3D point cloud. Unlike the existing architecture that fuses score maps from different modalities with equal weights, we propose a softmax weighted fusion stack that adaptively learns the varying contributions of PixelNet and VoxelNet and fuses the score maps according to their respective confidence levels. Our approach achieved competitive results on both the SUN RGB-D and NYU V2 benchmarks, while the runtime of the proposed system is boosted to around 13 Hz, enabling near-real-time performance using an i7 eight-cores PC with a single Titan X GPU.

Highlights

Real-time 3D semantic mapping is often desired in a number of robotics applications, such as localization [1,2], semantic navigation [3,4] and human-aware navigation [5]
Impressive results in semantic segmentation have been achieved with the advancement of convolutional neural networks (CNN)
RGB [11,12,13], RGB-D [14,15,16,17] and point cloud [18,19] data have been successfully utilized for semantic segmentation

Summary

Introduction

Real-time 3D semantic mapping is often desired in a number of robotics applications, such as localization [1,2], semantic navigation [3,4] and human-aware navigation [5]. A variety of well-known methods such as RGB-D SLAM [8], Kinect Fusion [9] and ElasticFusion [10] can generate a dense or semi-dense 3D map from RGB-D videos. These 3D maps contain no semantic-level understanding of the observed scenes. RGB [11,12,13], RGB-D [14,15,16,17] and point cloud [18,19] data have been successfully utilized for semantic segmentation Some of those methods are painfully slow due to their high computational demands. These methods are not yet integrated in real-time systems for robotics applications

Methods

Findings

Discussion

Conclusion