Abstract

Abstract To accurately obtain 3D information, the correct use of depth data is crucial. Compared with radar-based methods, detecting objects in 3D space in a single image is extremely challenging due to the lack of depth cues. However, monocular 3D object detection provides a more economical solution. Traditional monocular 3D object detection methods often rely on geometric constraints, such as key points, object shape relationships, and 3D to 2D optimization, to address the inherent lack of depth information. However, these methods still make it challenging to extract rich information directly from depth estimation for fusion. To fundamentally enhance the ability of monocular 3D object detection, we propose a monocular 3D object detection network based on depth information enhancement. The network learns object detection and depth estimation tasks simultaneously through a unified framework, integrates depth features as auxiliary information into the detection branch, and then constrains and enhances them to obtain better spatial representation. To this end, we introduce a new cross-modal fusion strategy, which realizes a more reasonable fusion of cross-modal information by exploring redundant, complementary information and their interactions between RGB features and depth features. Extensive experiments on the KITTI dataset show that our method can significantly improve the performance of monocular 3D object detection.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call