Abstract

Object detection in 3D scenes rely on two main methods: detection based on proposals (two-stage detectors) or detections based on anchors (single-stage detectors), similar to approaches for object detection in 2D. In this paper, we propose the 3DeTR framework that produces 3D detections without the use of anchors or proposals, allowing training of the entire neural network in an end-to-end manner. Raw point cloud scenes are augmented and input into distance-and-reflectiveness-based feature extractor to produce representative points. Then, a transformer encoder–decoder module learns the local object relations and global context to generate parallel detections, which are then passed to a set-based loss function to map predictions to the set of ground truth labels uniquely. The model’s architecture produces 3D detections by regressing directly with the set of ground truths without the need for anchors or proposals, which are bottlenecks for object detection performances. We tested the framework on the KITTI Vision Benchmark Suite 3D object detection dataset, achieving results on par with the state-of-the-art: 80.37 AP on Cars (Moderate) class and 47.92 AP on Pedestrians (Moderate) class.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.