Abstract

In the recent 3D object detection methods for point clouds, the combination of point-based methods and voxel-based methods is gradually becoming a trend. Point-based methods retain the accurate position and pose information in the raw points and voxel-based methods get multi-scale structure information through the 3D backbone. However, because of the sparsity and irregularity of point clouds, both representations ignore the context information, which is important for the detection of sparse and small objects. To solve this problem, we propose a multi-stream feature aggregation network to extract features from three representations of the point cloud for object detection. Specifically, we exploit multi-stream features extracted from point, voxel, and perspective view (PV) respectively on a parallel way, where the complementary information between different perspectives can be used to enrich the feature representations, especially for the perspective view containing rich semantic context information. Secondly, to eliminate redundant information and better exploit the correlation between different feature representations, we design an attention-based multi-stream feature fusion module (MSFF) to combine features from three information streams. Besides, we introduce a new voxel RoI pooling with the self-attention in the second refinement stage, which can further strengthen the connection between local features in the proposal to obtain accurate classification and localization predictions. Our method achieves progressive results on the KITTI dataset, especially in the cyclist category, which improves the baseline significantly by 5.56%, 4.73%, 5.16% AP in the test set for easy, moderate, and hard levels respectively. Code will be available at https://github.com/june2678/MR F.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call