Multi-Layer Self-Assessment with Filtering for 3D Object Detection in Autonomous Vehicles
Reliable detection of road users is critical to the safety of automated driving systems. While object detectors based on deep neural networks are widely used for this purpose, they remain susceptible to errors that could compromise safety. A promising strategy to mitigate these risks involves run-time perception monitoring mechanisms, commonly referred to in the literature as self-assessment or introspection. Current research in this area predominantly addresses anomaly detection, or monitoring camera-based 2D object detection, with insufficient focus on in-distribution errors and 3D object detection. Additionally, existing 2D studies often monitor activation patterns from the final layers of the network backbone, overlooking earlier activations that preserve higher spatial resolution. Yet, high-resolution early-layer activations can be valuable for detecting errors with sparse 3D point clouds. We also argue that not all objects in a scene should equally influence frame-level error detection, a factor often neglected in current methods. To address these gaps, we propose a novel self-assessment mechanism for 3D object detection that leverages activation patterns from multiple network layers. This mechanism employs spatial filtering to focus the model within an area of interest in the close vicinity of the ego vehicle. Additionally, it utilises an object filtering mechanism, which specifically targets the missed objects by excluding the points in those already detected. We evaluate our method using widely recognised object detectors and public datasets. Additionally, we demonstrate its robustness under domain shifts with real-world LiDAR data collected on motorways in diverse weather conditions. Results show the proposed mechanism provides 6% AUROC improvement over last-layer activation methods with spatial filtering on the NuScenes dataset. It also demonstrates a superior ability to transfer knowledge under domain shifts. Code is available at https://github.com/yatbazhakan/multi-layer-introspection .
- Research Article
53
- 10.1109/access.2021.3114399
- Jan 1, 2021
- IEEE Access
Nowadays, computer vision with 3D (dimension) object detection and 6D (degree of freedom) pose assumptions are widely discussed and studied in the field. In the 3D object detection process, classifications are centered on the object’s size, position, and direction. And in 6D pose assumptions, networks emphasize 3D translation and rotation vectors. Successful application of these strategies can have a huge impact on various machine learning-based applications, including the autonomous vehicles, the robotics industry, and the augmented reality sector. Although extensive work has been done on 3D object detection with a pose assumption from RGB images, the challenges have not been fully resolved. Our analysis provides a comprehensive review of the proposed contemporary techniques for complete 3D object detection and the recovery of 6D pose assumptions of an object. In this review research paper, we have discussed several proposed sophisticated methods in 3D object detection and 6D pose estimation, including some popular data sets, evaluation matrix, and proposed method challenges. Most importantly, this study makes an effort to offer some possible future directions in 3D object detection and 6D pose estimation. We accept the autonomous vehicle as the sample case for this detailed review. Finally, this review provides a complete overview of the latest in-depth learning-based research studies related to 3D object detection and 6D pose estimation systems and points out a comparison between some popular frameworks. To be more concise, we propose a detailed summary of the state-of-the-art techniques of modern deep learning-based object detection and pose estimation models.
- Research Article
64
- 10.1109/tip.2019.2955239
- Nov 28, 2019
- IEEE Transactions on Image Processing
With the rapid development of deep learning technology and other powerful tools, 3D object detection has made great progress and become one of the fastest growing field in computer vision. Many automated applications such as robotic navigation, autonomous driving, and virtual or augmented reality system require estimation of accurate 3D object location and detection. Under this requirement, many methods have been proposed to improve the performance of 3D object localization and detection. Despite recent efforts, 3D object detection is still a very challenging task due to occlusion, viewpoint variations, scale changes, and limited information in 3D scenes. In this paper, we present a comprehensive review of recent state-of-the-art approaches in 3D object detection technology. We start with some basic concepts, then describe some of the available datasets that are designed to facilitate the performance evaluation of 3D object detection algorithms. Next, we will review the state-of-the-art technologies in this area, highlighting their contributions, importance, and limitations as a guide for future research. Finally, we provide a quantitative comparison of the results of the state-of-the-art methods on the popular public datasets.
- Research Article
4
- 10.3390/rs15030627
- Jan 20, 2023
- Remote Sensing
Three dimensional (3D) object detection with an optical camera and light detection and ranging (LiDAR) is an essential task in the field of mobile robot and autonomous driving. The current 3D object detection method is based on deep learning and is data-hungry. Recently, semi-supervised 3D object detection (SSOD-3D) has emerged as a technique to alleviate the shortage of labeled samples. However, it is still a challenging problem for SSOD-3D to learn 3D object detection from noisy pseudo labels. In this paper, to dynamically filter the unreliable pseudo labels, we first introduce a self-paced SSOD-3D method SPSL-3D. It exploits self-paced learning to automatically adjust the reliability weight of the pseudo label based on its 3D object detection loss. To evaluate the reliability of the pseudo label in accuracy, we present prior knowledge based SPSL-3D (named as PSPSL-3D) to enhance the SPSL-3D with the semantic and structure information provided by a LiDAR-camera system. Extensive experimental results in the public KITTI dataset demonstrate the efficiency of the proposed SPSL-3D and PSPSL-3D.
- Research Article
17
- 10.3390/s23084005
- Apr 15, 2023
- Sensors (Basel, Switzerland)
This paper presents a benchmark analysis of NVIDIA Jetson platforms when operating deep learning-based 3D object detection frameworks. Three-dimensional (3D) object detection could be highly beneficial for the autonomous navigation of robotic platforms, such as autonomous vehicles, robots, and drones. Since the function provides one-shot inference that extracts 3D positions with depth information and the heading direction of neighboring objects, robots can generate a reliable path to navigate without collision. To enable the smooth functioning of 3D object detection, several approaches have been developed to build detectors using deep learning for fast and accurate inference. In this paper, we investigate 3D object detectors and analyze their performance on the NVIDIA Jetson series that contain an onboard graphical processing unit (GPU) for deep learning computation. Since robotic platforms often require real-time control to avoid dynamic obstacles, onboard processing with a built-in computer is an emerging trend. The Jetson series satisfies such requirements with a compact board size and suitable computational performance for autonomous navigation. However, a proper benchmark that analyzes the Jetson for a computationally expensive task, such as point cloud processing, has not yet been extensively studied. In order to examine the Jetson series for such expensive tasks, we tested the performance of all commercially available boards (i.e., Nano, TX2, NX, and AGX) with state-of-the-art 3D object detectors. We also evaluated the effect of the TensorRT library to optimize a deep learning model for faster inference and lower resource utilization on the Jetson platforms. We present benchmark results in terms of three metrics, including detection accuracy, frame per second (FPS), and resource usage with power consumption. From the experiments, we observe that all Jetson boards, on average, consume over 80% of GPU resources. Moreover, TensorRT could remarkably increase inference speed (i.e., four times faster) and reduce the central processing unit (CPU) and memory consumption in half. By analyzing such metrics in detail, we establish research foundations on edge device-based 3D object detection for the efficient operation of various robotic applications.
- Research Article
2
- 10.3724/sp.j.1089.2021.18368
- Mar 1, 2021
- Journal of Computer-Aided Design & Computer Graphics
<p indent=0mm>In the field of automatic driving, computer perception and understanding of the surrounding environment is essential. Compared with 2D object detection, 3D point cloud object detection can provide the three-dimensional information of the object that the 2D object detection does not have. In order to solve the problem of large disparity between the original input point cloud and the detection result in 3D object detection, a region proposal generation module based on structure awareness is proposed, in which the structural features of each point are defined, and the supervision information provided by the 3D point cloud object detection dataset is fully utilized. The network can learn more discriminative features to improve the quality of proposals. Secondly, the feature is added to the proposal fine-tuning stage to enrich the context features and local features of point cloud. Evaluated on KITTI 3D object detection dataset, in the region proposal generation stage, under the IoU threshold of 0.7, using 50 proposals, there is a more than 13% increase in the recall rate compared to previous results. In the proposal fine-tuning stage, the detection results of the 3 difficulty levels objects is obviously improved, indicating the effectiveness of the proposed method for 3D point cloud object detection.
- Conference Article
5
- 10.1117/12.2586932
- Jan 4, 2021
Advanced Driver Assistance System (ADAS) is a very important part of an up to date vehicle. For achieving highlevel objectives in such ADAS functionality like LKA (lane keeping assistance), LDW (lane departure warning system), FCW (forward collision warning) the quality of the algorithms under the hood must be extremely high. In the last few years, it is common that these algorithms are based on DNNs (deep neural networks) applied to the tasks of semantic and instance segmentation, 2D/3D object detection and visual object tracking. Recent state-of-the-art DNN models as usual solve only one single task from the listed above and running several neural networks is rather computationally expensive and even impossible due to the lack of the GPU memory. One of the approaches used to overcome such a problem is a shared backbone (also called feature extractor or encoder). The backbone consumes most of the computing resources thus the model with a shared backbone achieves better inference performance. Unfortunately, the training procedure for a shared backbone model has several difficulties. The first one is the lack of datasets with all the required and uniform annotation types. The second problem is a more sophisticated backpropagation procedure. In this paper, we consider several methods for multi-task neural network training and present the results of such training procedures on several public datasets with dissimilar annotation types. The shared backbone is applied to the following three tasks performed simultaneously on the road scene: semantic segmentation, 2D object detection and 3D object detection. While the performance of the DNNs with shared backbone increased significantly, we obtained the quality evaluation results, which are quite close to the original separate state-of-the-art DNNs and even outperforms them in some evaluation indices.
- Research Article
8
- 10.1111/mice.13143
- Dec 20, 2023
- Computer-Aided Civil and Infrastructure Engineering
Three‐dimensional (3D) object detection, that is, localizing and classifying all critical objects in a 3D space, is essential for downstream construction scene analysis tasks. However, accurate instance segmentation, few 2D object segmentation and 3D object detection data sets, high‐quality feature representations for depth estimation, and limited 3D cues from a single red‐green‐blue (RGB) image pose significant challenges to 3D object detection and severely hinder its practical applications. In response to these challenges, an improved cascade‐based network with a transformer backbone and a boundary‐patch‐refinement method is proposed to build hierarchical features and refine object boundaries, resulting in better results in 2D object detection and instance segmentation. Furthermore, a novel self‐supervised monocular depth learning method is proposed to extract better feature representations for depth estimation from construction site video data with unknown camera parameters. Additionally, a pseudo‐LiDAR point cloud method and a 3D object detection method with a density‐based clustering algorithm are proposed to detect 3D objects in a construction scene without help from 3D labels, which will serve as a good foundation for other downstream 3D tasks. Finally, the proposed model is evaluated for object instance segmentation and depth estimation on the moving objects in construction sites (MOCS) and construction scene data sets. It brings a 9.16% gain in terms of mean average precision (mAP) for object detection and a 4.92% gain in mask mAP for object instance segmentation. The average order accuracy and relative mean error for depth estimation are improved by 0.94% and 60.56%, respectively. This study aims to overcome the challenges and limitations of 3D object detection and facilitate practical applications in construction scene analysis.
- Conference Article
21
- 10.1109/iros.2013.6696479
- Nov 1, 2013
This paper reports on the use of planar patches as features in a real-time simultaneous localization and mapping (SLAM) system to model smooth surfaces as piecewise-planar. This approach works well for using observed point clouds to correct odometry error, even when the point cloud is sparse. Such sparse point clouds are easily derived by Doppler velocity log sensors for underwater navigation. Each planar patch contained in this point cloud can be constrained in a factor-graph-based approach to SLAM so that neighboring patches are sufficiently coplanar so as to constrain the robot trajectory, but not so much so that the curvature of the surface is lost in the representation. To validate our approach, we simulated a virtual 6-degree of freedom robot performing a spiral-like survey of a sphere, and provide real-world experimental results for an autonomous underwater vehicle used for automated ship hull inspection. We demonstrate that using the sparse 3D point cloud greatly improves the self-consistency of the map. Furthermore, the use of our piecewise-planar framework provides an additional constraint to multi-session underwater SLAM, improving performance over monocular camera measurements alone.
- Conference Article
7
- 10.1109/itsc48978.2021.9564553
- Sep 19, 2021
The performance of object detection methods based on LiDAR information is heavily impacted by the availability of training data, usually limited to certain laser devices. As a result, the use of synthetic data is becoming popular when training neural network models, as both sensor specifications and driving scenarios can be generated ad-hoc. However, bridging the gap between virtual and real environments is still an open challenge, as current simulators cannot completely mimic real LiDAR operation. To tackle this issue, domain adaptation strategies are usually applied, obtaining remarkable results on vehicle detection when applied to range view (RV) and bird's eye view (BEV) projections while failing for smaller road agents. In this paper, we present a BEV domain adaptation method based on CycleGAN that uses prior semantic classification in order to preserve the information of small objects of interest during the domain adaptation process. The quality of the generated BEVs has been evaluated using a state-of-the-art 3D object detection framework at KITTI 3D Object Detection Benchmark. The obtained results show the advantages of the proposed method over the existing alternatives.
- Conference Article
42
- 10.1109/cisp-bmei48845.2019.8965844
- Oct 1, 2019
3D object detection from raw and sparse point clouds has been far less treated to date, compared with its 2D counterpart. In this paper, we propose a novel framework called FVNet for 3D front-view proposal generation and object detection from point clouds. It consists of two stages: generation of front-view proposals and estimation of 3D bounding box parameters. Instead of generating proposals from camera images or bird's-eye-view maps, we first project point clouds onto a cylindrical surface to generate front-view feature maps which retains rich information. We then introduce a proposal generation network to predict 3D region proposals from the generated maps and further extrude objects of interest from the whole point cloud. Finally, we present another network to extract the point-wise features from the extruded object points and regress the final 3D bounding box parameters in the canonical coordinates. Our framework achieves real-time performance with 12ms per point cloud sample. Extensive experiments on the 3D detection benchmark KITTI show that the proposed architecture outperforms state-of-the-art techniques which take either camera images or point clouds as input, in terms of accuracy and inference time.
- Conference Article
74
- 10.1109/cvpr52688.2022.00966
- Jun 1, 2022
Segmenting or detecting objects in sparse Lidar point clouds are two important tasks in autonomous driving to allow a vehicle to act safely in its 3D environment. The best performing methods in 3D semantic segmentation or object detection rely on a large amount of annotated data. Yet annotating 3D Lidar data for these tasks is tedious and costly. In this context, we propose a self-supervised pretraining method for 3D perception models that is tailored to autonomous driving data. Specifically, we leverage the availability of synchronized and calibrated image and Lidar sensors in autonomous driving setups for distilling self-supervised pre-trained image representations into 3D models. Hence, our method does not require any point cloud nor image annotations. The keyingredient of our method is the use of superpixels which are used to pool 3D point features and 2D pixel features in visually similar regions. We then train a 3D network on the self-supervised task of matching these pooled point features with the corresponding pooled image pixel features. The advantages of contrasting regions obtained by superpixels are that: (1) grouping together pixels and points of visually coherent regions leads to a more meaningful contrastive task that produces features well adapted to 3D semantic segmentation and 3D object detection; (2) all the different regions have the same weight in the contrastive loss regardless of the number of 3D points sampled in these regions; (3) it mitigates the noise produced by incorrect matching of points and pixels due to occlusions between the different sensors. Extensive experiments on autonomous driving datasets demonstrate the ability of our image-to-Lidar distillation strategy to produce 3D representations that transfer well on semantic segmentation and object detection tasks.
- Conference Article
1
- 10.1109/iv47402.2020.9304629
- Oct 19, 2020
In this paper, we propose a point cloud based 3D object detection framework that accounts for both contextual and local information by leveraging multi-receptive field pillars, named as MuRF-Net. Recently, common pipelines can be divided into a voxel-based feature encoder and an object detector. During the feature encoding steps, contextual information is neglected, which is critical for the 3D object detection task. Thus, the encoded features are not suitable to input to the subsequent object detector. To address this challenge, we propose the MuRF-Net with a multi-receptive field voxelization mechanism to capture both contextual and local information. After the voxelization, the voxelized points (pillars) are processed by a feature encoder, and a channel-wise feature reconfiguration module is proposed to combine the features with different receptive fields using a lateral enhanced fusion network. In addition, to handle the increase of memory and computational cost brought by multi-receptive field voxelization, a dynamic voxel encoder is applied taking advantage of the sparseness of the point cloud. Experiments on the KITTI benchmark for both 3D object and Bird's Eye View (BEV) detection tasks on car class are conducted and MuRF-Net achieved the state-of-the-art results compared with other voxel-based methods. Besides, the MuRF-Net can achieve nearly real-time speed with 20Hz.
- Conference Article
2
- 10.1109/rivf55975.2022.10013923
- Dec 20, 2022
Object detection is one of the important applications and plays a decisive role in the research related to the field of image processing, in which 3D object detection has been a challenge in the field of self-driving cars. Many research papers on 3D image detection have been published, but each has its applications and limitations. To have an overview of 3D object detection based on which suitable methods for related applications are proposed, the paper makes a survey and classification based on the image-based method, point-cloud-based method and fusion-based method. In this paper, the authors will discuss, evaluate, and classify the most recent research on 3D object recognition and detection used in the field of self-driving cars with strengths and limitations. Besides, challenges to current successful 3D objection techniques and future research suggestions are also analyzed.
- Research Article
31
- 10.3390/wevj15010020
- Jan 7, 2024
- World Electric Vehicle Journal
The pursuit of autonomous driving relies on developing perception systems capable of making accurate, robust, and rapid decisions to interpret the driving environment effectively. Object detection is crucial for understanding the environment at these systems’ core. While 2D object detection and classification have advanced significantly with the advent of deep learning (DL) in computer vision (CV) applications, they fall short in providing essential depth information, a key element in comprehending driving environments. Consequently, 3D object detection becomes a cornerstone for autonomous driving and robotics, offering precise estimations of object locations and enhancing environmental comprehension. The CV community’s growing interest in 3D object detection is fueled by the evolution of DL models, including Convolutional Neural Networks (CNNs) and Transformer networks. Despite these advancements, challenges such as varying object scales, limited 3D sensor data, and occlusions persist in 3D object detection. To address these challenges, researchers are exploring multimodal techniques that combine information from multiple sensors, such as cameras, radar, and LiDAR, to enhance the performance of perception systems. This survey provides an exhaustive review of multimodal fusion-based 3D object detection methods, focusing on CNN and Transformer-based models. It underscores the necessity of equipping fully autonomous vehicles with diverse sensors to ensure robust and reliable operation. The survey explores the advantages and drawbacks of cameras, LiDAR, and radar sensors. Additionally, it summarizes autonomy datasets and examines the latest advancements in multimodal fusion-based methods. The survey concludes by highlighting the ongoing challenges, open issues, and potential directions for future research.
- Research Article
16
- 10.1016/j.image.2022.116667
- Feb 17, 2022
- Signal Processing: Image Communication
CARL-D: A vision benchmark suite and large scale dataset for vehicle detection and scene segmentation
- Ask R Discovery
- Chat PDF
AI summaries and top papers from 250M+ research sources.