Pillar-based method is significant in the field of LiDAR-based 3D object detection which could directly make use of efficient 2D backbone and save computational resources during reference. Existing methods usually sequentially project the original point clouds into the cylindrical view or the bird-eye view for feature extraction. However, the former suffers from obscured problems and the scales of instances vary greatly with distance, while the latter leads to considerable confusion problems due to the loss of semantic information caused by the sparsity of the projected point cloud. In this paper, we present a novel and efficient two-stage point-pillar hybrid architecture named Attentive Multi-View Fusion Network (AMVFNet), in which we abstract features from all cylindrical view, bird-eye view, and raw point clouds. Rather than designing more complex modules to solve the problems inherent in the single-view approach, our multi-view fusion architecture effectively combines the strengths of multiple perspectives to improve performance at a more fundamental level. Besides, to compensate for quantization distortion caused by projection operations, we propose attentive feature enhancement layers to further improve the capability of contextual information capturing. Extensive experiments on the KITTI detection benchmark illustrate that our proposed AMVFNet achieves competitive performance compared with other SOTA 3D object detectors.
Read full abstract