Dense semantic scene understanding of the surrounding environment in top-view is a crucial task for autonomous vehicles. Recent LiDAR-based semantic perception works mainly focus on point-wise predictions of the LiDAR points instead of dense predictions of the environment, making them not appropriate for path-planning tasks. Pillar and voxel representations can achieve dense predictions, but the generation of data representation and data processing are usually time-consuming. In this paper, we propose a top-view semantic completion network to produce accurate dense grid-wise predictions with real-time performance. Specifically, we propose an online distillation strategy, consisting of two parts: a student model using 2D range-view and top-view representations, and a teacher model using range-view, top-view, and voxel representations. To realize information transfer between different representations, we propose a cross-view association (CVA) module, by which the range-view features and 3D voxel features are converted into the ones in the top-view. The proposed method can avoid the difficulty of direct dense semantic segmentation in the top-view, with the point-wise sparse semantic segmentation module acting as a guide for the dense grid-wise semantic completion in a semantic-completion way. It can also alleviate the computational complexity by using only the voxel representation and 3D convolution in the teacher model. The experimental results on the SemanticKITTI dataset (46.4% mIoU) and nuScenes-LidarSeg dataset (47.3% mIoU) demonstrate the effectiveness of the proposed sparse guidance and online distillation strategies.
Read full abstract