从高分辨率图像中获取周边目标的精准3D位置和尺寸信息是实现自动驾驶控制和行为决策的基础,因此基于图像的3D目标检测是自动驾驶领域中的研究热点。已有学者对该领域方法论及成果进行了比较详细的综述,但对于导致现有方法检测精度不尽如意的制约因素未能进行深入系统的分析。考虑自动驾驶领域在工程应用方面的要求高,且现有方法以数据驱动类型为主,本文从常用数据集和评价基准、数据影响、方法论的制约因素和误差等角度,对学术界和产业界在3D目标检测方面的研究成果及行业应用进行较为系统的阐述。首先,从学术界探索成果以及自动驾驶行业的应用角度进行概要介绍。然后,从数据采集设备、数据精度和标注信息3方面详细分析总结了KITTI等4个通用数据集,并对这些数据集提出的主要评价指标进行对比分析。接着,从数据和方法论方面分析制约算法性能的主要因素及由此造成的误差影响。在数据方面,制约因素主要是数据精度、样本差异、标注数据量和标注规范;在方法论方面,制约因素主要包括先验几何关系、深度预测误差和数据模态等。最后,对国内外研究现状进行总结,并在数据集、评价指标和目标深度预测等方面提出了未来需要重点关注的研究方向。;Autonomous driving-oriented accurate perception and measurement of the three-dimensional(3D) spatial position and scale can be as the basis for realizing the control ability and decision-making level. Sensing technology-driven autonomous vehicles are equipped with high-resolution camera, light detection and ranging (LiDAR), radar, global positioning system (GPS)/inertial measurement unit (IMU) and other related sensors. Current LiDAR or multi-modal data-based 3D object detection algorithms are challenged for its deployment and application because of the shortcomings of LiDAR sensors like high price, limited sensing range, and sparse point clouds data. In contrast, such high-resolution cameras are commonly-used and featured by its lower price, and it can obtain high-resolution spatial information, richer shape, and appearance details as well. The emerging image-based 3D object detection is focused on further. At present, constraints of detection accuracy of the existing methods are still to be analyzed thoroughly and systematically. We summary the research results and industrial applications in relevance to such 1) perspectives of commonly used datasets and evaluation criteria, 2) data impact, 3) methodological constraints and prediction errors. First, a brief introduction is linked to perspective of academic domain and application of autonomous driving industry. We briefly review latest growths of Baidu Apollo, Google Waymo, Tesla and other related autonomous driving companies, and the thread of 3D object detection methods for autonomous driving. Then, we analyzed and summarized four popular datasets like KITTI, nuScenes, Waymo open dataset, and DAIR-V2X dataset from three aspects of:1) data acquisition/sensors, data accuracy and data label information;2) key evaluation standards proposed by these data sets, and 3) pros/cons and applicability of these evaluation standards. Third, main constraints of the image-based 3D object detection algorithm and the errors are derived from two sides of:data and methodology. Such main data constraints are originated from their data accuracy, sample difference, data volume, and data annotation. The data accuracy is mainly limited by equipment performance. The sample difference is mainly restricted by such image processing problems in related to object distance difference, angle difference, occlusion, and truncation. Data volume is affected by variety of 3D data types and high difficulty of labeling. The volume of 3D object detection data set is much smaller in comparison with the 2D object detection data set. Data annotation is mainly focused on 3D bounding box labeling, the labeling details, and quality of the dataset, especially for image annotation used in image-based 3D object detection. For non-rigid objects like pedestrians, the annotation error is larger, and there are some optimal for improving the labeling method. The general framework of image-based 3D object detection can be classified as one-stage methods and two-stage methods, and the limitations consists of 1) the prior geometric relationship, 2) depth prediction accuracy, and 3) data modality. The prior geometric relationship is focused on 2D-3D geometric constraints for 2D images-projected 3D objects and objects-between position relationships. The image-based 3D object detection methods face such problems as:prior 2D-3D geometric constraints and occluded and truncated objects. The prediction of depth information from 2D images is an ill conditioned problem, and dimension collapse will cause depth prediction error-relevant loss of depth information in the image. On the one hand, the depth prediction is often not accurate due to the influence of projection relationship. On the other hand, the performance of continuous depth prediction is often poor at the depth mutation of the image(such as edge of objects). When the prediction depth is discretized, there is a problem that the classification of depth is relatively rough, and the accuracy classification cannot be arbitrarily divided. The limitation of single image-based data modality is mainly reflected via large error of depth prediction. The detection performance of the algorithm can be optimized by 1) simulating the stereo signal and LiDAR point clouds, or 2) using stereo image as the aided input, or 3) leveraging point clouds data with accurate 3D information as supervision signal. In addition, video data can be adopted to improve the detection accuracy to a certain extent. Forth, current research situation is summarized and compared from academic and industrial domain. Finally, some future research directions are predicted in terms of such factors of datasets, evaluation indicators, and depth prediction.
Read full abstract