Abstract

3D object detection in point cloud aims at simultaneously localizing and recognizing 3D objects from a 3D point set. However, since point clouds are usually sparse, unordered, and irregular, it is challenging to learn robust point representations and sample high-quality object queries. To deal with the above issues, we propose a Long-short rangE Adaptive transformer with Dynamic sampling (LeadNet), including a point representation encoder, a dynamic object query sampling decoder, and an object detection decoder in a unified architecture for 3D object detection. Specifically, in the point representation encoder, we combine an attention layer and a channel attentive kernel convolution layer to consider the local structure and the long-range context simultaneously. In the dynamic object query sampling decoder, we utilize multiple dynamic prototypes to adapt to various point clouds. In the object detection decoder, we incorporate a dynamic Gaussian weight map into the cross-attention mechanism to help the detection decoder focus on the proper visual regions near the object, further accelerating the training process. Extensive experimental results on two standard benchmarks show that our LeadNet outperforms the 3DETR baseline by 11.6% mAP <sub xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">50</sub> on the ScanNet v2 dataset and achieves the new state-of-the-art results on ScanNet v2 and SUN RGB-D benchmarks for the geometric-only approaches.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call