Abstract
Currently, many kinds of LiDAR-camera-based 3D object detectors have been developed with two heavy neural networks to extract view-specific features, while a LiDAR-camera-based 3D detector with only one neural network has not been implemented. To tackle this issue, this paper first presents an early-fusion method to exploit both LiDAR and camera data for fast 3D object detection with only one backbone, achieving a good balance between accuracy and efficiency. We propose a novel point feature fusion module to directly extract point-wise features from raw RGB images and fuse them with their corresponding point cloud with no backbone. In this paradigm, the backbone that extracts RGB image features is abandoned to reduce the large computation cost. Our method first voxelizes a point cloud into a 3D voxel grid and utilizes two strategies to reduce information loss during voxelization. The first strategy is to use a small voxel size (0.05m, 0.05m, 0.1m) in X-axis, Y-axis, and Z-axis, respectively, while the second one is to project the feature (e.g. intensity or height information) of point clouds onto RGB images. Numerous experiments evaluated on the KITTI benchmark suite show that the proposed approach outperforms state-of-the-art LiDAR-camera-based methods on the three classes in 3D performance (Easy, Moderate, Hard): cars (88.04%, 77.60%, 76.23%), pedestrians (66.65%, 60.49%, 54.51%), and cyclists (75.87%, 60.07%, 54.51%). Additionally, the proposed model runs at 17.8 frames per second (FPS), which is almost 2× faster than state-of-the-art fusion methods for LiDAR and camera.
Highlights
With the rapid development of autonomous vehicles, three-dimensional (3D) object detection has become more important, whose purpose is to perceive the size and accurate location of objects in the real world
This paper proposes a highly-efficient pointwise feature fusion module, which directly extracts the RGB image point feature based on a point cloud and fuses the extracted RGB image point feature with the corresponding feature of the point cloud
RELATED WORK This section starts by reviewing recent works in applying convolutional neural networks (CNNs) to 3D object detection based on LiDAR, and focuses on methods specific to multi-modal 3D object detection from point clouds and RGB images
Summary
With the rapid development of autonomous vehicles, three-dimensional (3D) object detection has become more important, whose purpose is to perceive the size and accurate location of objects in the real world. LiDAR is employed to collect the surrounding 3D data, referred to as a point cloud, and the camera is used to capture a high-resolution RGB image. It is non-trivial to highly efficiently and quickly extract and fuse the features of the point cloud and RGB image. Before the advent of highly-efficient graphics processing units (GPUs), representative studies [5]–[10] have converted point clouds into 2D dense images or structured voxel-grid representations and utilized 2D neural networks to extract the corresponding feature from the converted 2D image. Jo: Fast and Accurate 3D Object Detection for Lidar-Camera-Based Autonomous Vehicles object detection with only one backbone, achieving a good balance between accuracy and efficiency. This paper enhances 3D object detection with an RGB+ image, which preserves the information projected from its corresponding point cloud. The presented one-stage 3D multi-class object detection framework outperforms state-of-the-art LiDAR-camerabased methods on the KITTI benchmark [18] both in terms of the speed and accuracy
Published Version (Free)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have