Multi-Source Features Fusion Single Stage 3D Object Detection With Transformer

Guofeng Tong,Hao Peng,Yaqi Wang,Zheng Li

doi:10.1109/lra.2023.3244124

Abstract

Due to the high efficiency in extracting context information, voxel-based method is widely used in 3D object detection from point cloud. However, the quantization loss of geometric information is inevitable in the process of voxelization for raw point cloud, which may have a certain impact on final detection performance. To alleviate this problem, we propose a novel single-stage 3D detection algorithm named MFT-SSD for accurate 3D bounding box prediction. Differ from most single-stage methods, our proposed framework combines point-based and voxel-based backbones, and extracts point, multi-scale voxel and BEV features as multi-source features, respectively. In order to enhance the correlation among different representation features, we propose a transformer feature fusion module with self-attention mechanism to fully integrate these multi-source features into richer point-wise features. Then, these fused point-wise features are sent to a candidate generation layer to generate a series of candidate points closer to instance centers. The final 3D bounding boxes are predicted on these generated candidate points. Relevant experiments on KITTI and nuScenes datasets verify that our proposed algorithm has achieved a competitive level compared with some state-of-the-art algorithms.

Full Text