Abstract
In this paper, we propose a novel deep architecture by combining multiple sensors for 3D object detection, named MSL3D. While recently LiDAR-Camera methods introduce additional semantic cues, working with fewer false detections, there is still a performance gap compared LiDAR-only methods. We argue that this gap is caused for two reasons: 1) the 3D spherical receptive fields of the set abstraction of the point clouds are not aligned with the 2D pixel-level receptive fields of the image. 2) the premature introduction of image information makes it is difficult to apply data augmentation both LiDAR and image synchronously. For the first problem, we extend 3D set abstraction to a 2D set abstraction that can transform the 2D image features to the 3D sphere to unify the receptive field of multi-modal data. For the second problem, we design a novel two-stage 3D detection framework that employs the LiDAR-only backbone in the first stage to estimate high-recall and high-quality proposals and then integrates the image and point clouds information for box refinement and confidence prediction. Besides, we add two auxiliary networks to effectively learn image features and point cloud features when using different multi-modal data augmentation strategies synchronously. Moreover, we design a consistency-structure generator using stereo images to determine whether any of a point in the 3D space belongs to the contour of the object, thereby supplementing the sparse point cloud information. Extensive experiments on the popular KITTI 3D objects detection dataset show that our proposed MSL3D achieves better performance comparing with other LiDAR-Only or LiDAR-Camera fusion approaches.
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have