Abstract
Recent works on 3D object detection take the range image as input, which have achieved comparable performance with bird's eye view (BEV) based methods. Compared to BEV, range view provides dense and compact observations which allows for more popular feature encoders. To leverage complementary information of range view and BEV, we present ACDet - a novel single-stage multi-view fusion method. Rather than fusing point-level features from range view and BEV at early stage, the key contribution is that we introduce an attentive cross-view fusion module based on transformer to fuse higher level features, and further adopt a supervised foreground mask learned from BEV features to enhance the fused features. Notably, a geometric-attention kernel is proposed to enhance features extracted from range image. Finally, we design an anchor-free detection head with optimized label assignment strategy, and its performance exceeds the existing anchor-based and anchor-free 3D detection heads by a large margin. We evaluate our ACDet model extensively on the KITTI dataset and Waymo Open Dataset (WOD). ACDet outperforms most of singlestage models on KITTI dataset in terms of multi-class 3D and BEV mean average precision. ACDet also outperforms both range-view and multi-view fusion methods on WOD.
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have