SMS-Net: Sparse multi-scale voxel feature aggregation network for LiDAR-based 3D object detection

Sheng Liu,Wenhao Huang,Yifeng Cao,Dingda Li,Shengyong Chen

doi:10.1016/j.neucom.2022.06.054

Abstract

Real-time three-dimensional (3D) object detection has become a crucial component of autonomous driving applications. Recent research demonstrates that a voxel-based feature aggregation method is accurate and efficient in large 3D scenes. However, the parameter choice of voxel size has become a sensitive issue because of the contradiction between its detection performance and inference speed. To alleviate this problem, in this paper we propose a sparse multi-scale voxel feature aggregation network (SMS-Net), a novel one-stage, end-to-end network that primarily contains a sparse multi-scale-fusion (SMSF) module and shallow-to-deep regression (SDR) module. First, the raw point clouds are divided into different scales of voxels to construct diverse 3D sparse feature maps. Then, the SMSF module attentively aggregates the point-wise features with a perspective-channel attention mechanism and fuses multi-scale features at 3D sparse feature-map level to achieve more fine-grained shape information. In addition, the new SDR module boosts the localization accuracy and 3D box estimation accuracy through multiple aggregation at feature-map level, which requires less computational overhead. Extensive experiments demonstrate the remarkable performance improvements from each module of the proposed method. On the KITTI 3D object detection benchmark, for example, SMS-Net outperforms most one-stage, state-of-the-art methods and its performance can even be compared to that of two-stage methods. These detection results are achieved with a real-time inference speed of 42 Hz. SMS-Net also achieves state-of-the-art performance on the nuScenes 3D benchmark.

Full Text