Abstract

In autonomous driving, lidar has become the main vehicle sensor due to its advantages such as long-range measurement and high accuracy. However, the collected point cloud data is sparse and unevenly distributed, and it lacks characterization capabilities when facing objects with missing or similar shapes, so that the detection accuracy is low while detecting long-distance small targets with similar shapes and a small number of point clouds. In order to improve the detection accuracy of small targets represented by point clouds, this paper adopts a method that fuses point cloud and RGB image to construct a 3D object detection network architecture based on two-stage complementary fusion. In the first stage of the fusion, we use the FPS method to select some points from the raw point cloud data as key points. Then, we voxelize the raw point cloud and use the 3D sparse convolutional neural network to extract multi-scale points cloud features, which would fuse features of different scales with key points. In the second stage of fusion, a 2D object detector is used to obtain the 2D bounding box and category information of the target in the image and take the camera as the origin to extend along the direction of the 2D bounding box to form a frustum; then the point cloud and target category information within the frustum are fused into the key points. This paper uses key points as a bridge to effectively combine the semantic information of the image such as texture and color with the point cloud features. The experimental results show that the network proposed in this paper has excellent performance on small objects.

Highlights

  • Deep learning, as a major artificial intelligence technology, has been widely used in computer vision, speech recognition, and natural language processing; more notably, the application of deep learning to 2D object detection [1,2,3,4,5,6] and semantic segmentation [7,8,9,10] techniques in computer vision has led to the rapid development of both in the past 10 years

  • Ereby, this paper combines the advantages of Voxel-based, Point-based, and Image-based methods to propose a 3D object detection network architecture based on two-stage complementary fusion. is architecture is based on 3D sparse convolutional neural network combined with mature 2D object detector, and the key points are utilized as a bridge, so that a series of precise and guided fusion methods can be designed to accurately predict the categories, 3D positions, and other information of the objects in the surrounding environment

  • In order to fully combine the target object’s geometric information collected by the lidar and the semantic information collected by the camera, this paper proposed a 3D object detection network structure based on two-stage complementary fusion. is structure uses key points as a bridge to combine Point-based, Voxel-based, and Imagebased methods, which fully integrates and utilizes the geometric information of the point cloud and the semantic information of the image

Read more

Summary

Introduction

As a major artificial intelligence technology, has been widely used in computer vision, speech recognition, and natural language processing; more notably, the application of deep learning to 2D object detection [1,2,3,4,5,6] and semantic segmentation [7,8,9,10] techniques in computer vision has led to the rapid development of both in the past 10 years. E architecture uses key points as a bridge to successfully combine Point-based, Voxel-based, and Image-based methods. In this way, the complementary fusion of the geometric information collected by the lidar and the semantic information collected by the camera is realized. (2) e innovation of this paper is proposing a Feature Fusion Model, which fuses features from voxels, original point cloud features, BEV features, and RGB image features, and conducts experimental analysis on the contribution of each part of the features to the accuracy of the network model, and the results show that, after adding RGB image features, the accuracy of point cloud object detection in 3D space is significantly improved on small targets, and objects with similar structures are substantially improved. (3) In order to solve the problem of overreliance on the detection performance of 2D detectors in point cloud and image fusion, this paper proposes a method for assigning the foreground point features and background point features within the cone according to the confidence score and fusing them with the point cloud features by applying parallel processing to the first and second stages to use the features provided by the images as auxiliary features

Related Work
System Design
Experiment
Method
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call