MsVFE and V-SIAM: Attention-based multi-scale feature interaction and fusion for outdoor LiDAR semantic segmentation

Jingru Yang,Jingru Yang,Jin Wang,Jin Wang,Kaixiang Huang,Kaixiang Huang,Guodong Lu,Guodong Lu,Yu Sun,Huan Yu,Huan Yu,Cheng Zhang,Cheng Zhang,Ying Yang,Wenming Zou

doi:10.1016/j.neucom.2024.127576

Abstract

The semantic segmentation of outdoor LiDAR point clouds is one of the gigantic fields in the large-scale driving scenario. However, the performances of the state-of-the-art methods are unsatisfactory caused by the intrinsic limitations of the outdoor point clouds with excessive distribution of sparsity and imbalanced distribution of density, both of which become great challenges for a precise segmentation of LiDAR point clouds. To tackle the aforementioned intrinsic problems of point clouds with an improved segmentation performance, we propose a brand new attention-based feature interaction module called Voxel Slicing and Interaction based Attention Module (V-SIAM) that is integrated into the segmentation networks. Our V-SIAM is composed of a Voxel Slicing and Interaction Module (V-SIM) followed by a Voxel Attention Module (VAM), where the V-SIM is utilized to significantly reduce the negative impact caused by the imbalanced point density, in terms of enhancing the interaction of the voxel feature by a novel feature slicing, leading to the enriched voxel feature details of the point clouds. Moreover, the VAM is utilized to reduce the negative effect caused by the excessive sparsity of the point clouds, in terms of recalibration among voxel features via an innovative way, leading to the extraction of adaptive and self-enhanced voxel features. Besides the V-SIAM, a Multi-scale Voxel Feature Extractor (MsVFE), utilized as the preprocessing module of the segmentation networks, is proposed to further alleviate the negative influence caused by the excessive sparsity of the point clouds, realized by fusing the multi-scale voxel information of the sparse point clouds, leading to extraction of more detailed features of the point clouds. Our experimental results show that the proposed methods achieve 73.7% mIoU on the large-scale SemanticKITTI benchmark, outperforming the state-of-the-art PVKD and 2DPass by +1.3% mIoU and +0.8% mIoU, respectively. Moreover, our proposed MsVFE and V-SIAM achieve the state-of-the-art performance on the Toronto3D dataset and KITTI-360 dataset.

Full Text