In this paper, we propose a novel 3D object detection model that leverages the advantages of the Voxel Transformer (VoTr) and the Confident IoU-Aware Single-Stage Object Detector (CIA-SSD) to address the challenges of detecting objects in 3D point clouds. Our model adopts the VoTr as its backbone, which enables long-range interactions between voxels via a self-attention mechanism. This overcomes the limitations of conventional voxel-based 3D detectors, which struggle to capture sufficient contextual information due to their restricted receptive fields. Our model also integrates the sparse voxel module and the submanifold voxel module, which efficiently process empty and non-empty voxel positions, effectively handling the natural sparsity and abundance of non-empty voxels. Moreover, inspired by the CIA-SSD design, our model incorporates the Spatial-Semantic Feature Aggregation (SSFA) module, which allows for the adaptive fusion of high-level abstract semantic features and low-level spatial features, ensuring accurate predictions of bounding boxes and classification confidence. Furthermore, based on the IoU-aware confidence rectification module, which refines the alignment between confidence scores and localization accuracy, we devise an Optimized RPN (Region Proposal Network) Detection Head module as a dense head to further predict the IoU loss and improve the accuracy. In this paper, we combine two state-of-the-art techniques to provide a precise and efficient solution for 3D object detection in point clouds. We evaluate our model on the KITTI dataset1 and achieve 76.56 % accuracy in terms of AP3D (%) Hard.
Read full abstract