Abstract

Object detection in 3D space is a fundamental technology in the autonomous driving system. Among the published 3D object detection methods, the single-modal methods based on point clouds have been widely studied. One problem exposed by these methods is that point clouds lack color and texture features. The limitation in conveying semantic information often leads to failures in detection. In contrast, the multi-modal methods based on the image and point clouds fusion may solve this problem, but relevant research is not sufficient. In this work, a single-stage multi-view multi-modal 3D object detector (MVMM) is proposed, which can naturally and efficiently extract semantic and geometric information from the image and point clouds. Specifically, the data-level fusion approach of point clouds coloring is used for combining information from the camera and LIDAR. Next, an encoder-decoder backbone is devised to extract features from colored points in the range view. Then, colored points are concatenated with the range view features, voxelized, and fed into the point view bridge for down-sampling. Finally, the down-sampled feature map is used by the bird's eye view backbone and the detection head for generating 3D results based on predefined anchors. According to extensive experiments on the KITTI dataset, MVMM achieves competitive performance while runs at 27 FPS on the 1080 Ti GPU. Particularly, MVMM performs extremely well in difficult scenes (e.g., heavy occlusion and truncation) due to the understanding of fused information.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call