Accurate recognition and localization of 3D objects is a fundamental research problem in 3D computer vision. Benefiting from transformation-free point cloud processing and flexible receptive fields, point-based methods have become accurate in 3D point cloud modeling, but still fall behind voxel-based competitors in 3D detection. We observe that the set abstraction module, commonly utilized by point-based methods for downsampling points, tends to retain excessive irrelevant background information, thus hindering the effective learning of features for object detection tasks. To address this issue, we propose MSSA, a Multi-representation Semantics-augmented Set Abstraction for 3D object detection. Specifically, we first design a backbone network to encode different representation features of point clouds, which extracts point-wise features through PointNet to preserve fine-grained geometric structure features, and adopts VoxelNet to extract voxel features and BEV features to enhance the semantic features of key points. Second, to efficiently fuse different representation features of keypoints, we propose a Point feature-guided Voxel feature and BEV feature fusion (PVB-Fusion) module to adaptively fuse multi-representation features and remove noise. At last, a novel Multi-representation Semantic-guided Farthest Point Sampling (MS-FPS) algorithm is designed to help set abstraction modules progressively downsample point clouds, thereby improving instance recall and detection performance with more important foreground points. We evaluate MSSA on the widely used KITTI dataset and the more challenging nuScenes dataset. Experimental results show that compared to PointRCNN, our method improves the AP of “moderate” level for three classes of objects by 7.02%, 6.76%, and 5.44%, respectively. Compared to the advanced point-voxel-based method PV-RCNN, our method improves the AP of “moderate” level by 1.23%, 2.84%, and 0.55% for the three classes, respectively.
Read full abstract