Uncertainty-Aware AB3DMOT by Variational 3D Object Detection
Autonomous driving needs to rely on high-quality 3D object detection to ensure safe navigation in the world. Uncertainty estimation is an effective tool to provide statistically accurate predictions, while the associated detection uncertainty can be used to implement a more safe navigation protocol or include the user in the loop. In this paper, we propose a Variational Neural Network-based TANet 3D object detector to generate 3D object detections with uncertainty and introduce these detections to an uncertainty-aware AB3DMOT tracker. This is done by applying a linear transformation to the estimated uncertainty matrix, which is subsequently used as a measurement noise for the adopted Kalman filter. We implement two ways to estimate output uncertainty, i.e., internally, by computing the variance of the CNN outputs and then propagating the uncertainty through the post-processing, and externally, by associating the final predictions of different samples and computing the covariance of each predicted box. In experiments, we show that the external uncertainty estimation leads to better results, outperforming both internal uncertainty estimation and classical tracking approaches. Furthermore, we propose a method to initialize the Variational 3D object detector with a pretrained TANet model, which leads to the best performing models.
- Research Article
14
- 10.1111/mice.13143
- Dec 20, 2023
- Computer-Aided Civil and Infrastructure Engineering
Three‐dimensional (3D) object detection, that is, localizing and classifying all critical objects in a 3D space, is essential for downstream construction scene analysis tasks. However, accurate instance segmentation, few 2D object segmentation and 3D object detection data sets, high‐quality feature representations for depth estimation, and limited 3D cues from a single red‐green‐blue (RGB) image pose significant challenges to 3D object detection and severely hinder its practical applications. In response to these challenges, an improved cascade‐based network with a transformer backbone and a boundary‐patch‐refinement method is proposed to build hierarchical features and refine object boundaries, resulting in better results in 2D object detection and instance segmentation. Furthermore, a novel self‐supervised monocular depth learning method is proposed to extract better feature representations for depth estimation from construction site video data with unknown camera parameters. Additionally, a pseudo‐LiDAR point cloud method and a 3D object detection method with a density‐based clustering algorithm are proposed to detect 3D objects in a construction scene without help from 3D labels, which will serve as a good foundation for other downstream 3D tasks. Finally, the proposed model is evaluated for object instance segmentation and depth estimation on the moving objects in construction sites (MOCS) and construction scene data sets. It brings a 9.16% gain in terms of mean average precision (mAP) for object detection and a 4.92% gain in mask mAP for object instance segmentation. The average order accuracy and relative mean error for depth estimation are improved by 0.94% and 60.56%, respectively. This study aims to overcome the challenges and limitations of 3D object detection and facilitate practical applications in construction scene analysis.
- Research Article
75
- 10.1109/tip.2019.2955239
- Nov 28, 2019
- IEEE Transactions on Image Processing
With the rapid development of deep learning technology and other powerful tools, 3D object detection has made great progress and become one of the fastest growing field in computer vision. Many automated applications such as robotic navigation, autonomous driving, and virtual or augmented reality system require estimation of accurate 3D object location and detection. Under this requirement, many methods have been proposed to improve the performance of 3D object localization and detection. Despite recent efforts, 3D object detection is still a very challenging task due to occlusion, viewpoint variations, scale changes, and limited information in 3D scenes. In this paper, we present a comprehensive review of recent state-of-the-art approaches in 3D object detection technology. We start with some basic concepts, then describe some of the available datasets that are designed to facilitate the performance evaluation of 3D object detection algorithms. Next, we will review the state-of-the-art technologies in this area, highlighting their contributions, importance, and limitations as a guide for future research. Finally, we provide a quantitative comparison of the results of the state-of-the-art methods on the popular public datasets.
- Research Article
3
- 10.3724/sp.j.1089.2021.18368
- Mar 1, 2021
- Journal of Computer-Aided Design & Computer Graphics
<p indent=0mm>In the field of automatic driving, computer perception and understanding of the surrounding environment is essential. Compared with 2D object detection, 3D point cloud object detection can provide the three-dimensional information of the object that the 2D object detection does not have. In order to solve the problem of large disparity between the original input point cloud and the detection result in 3D object detection, a region proposal generation module based on structure awareness is proposed, in which the structural features of each point are defined, and the supervision information provided by the 3D point cloud object detection dataset is fully utilized. The network can learn more discriminative features to improve the quality of proposals. Secondly, the feature is added to the proposal fine-tuning stage to enrich the context features and local features of point cloud. Evaluated on KITTI 3D object detection dataset, in the region proposal generation stage, under the IoU threshold of 0.7, using 50 proposals, there is a more than 13% increase in the recall rate compared to previous results. In the proposal fine-tuning stage, the detection results of the 3 difficulty levels objects is obviously improved, indicating the effectiveness of the proposed method for 3D point cloud object detection.
- Research Article
81
- 10.1109/access.2021.3114399
- Jan 1, 2021
- IEEE Access
Nowadays, computer vision with 3D (dimension) object detection and 6D (degree of freedom) pose assumptions are widely discussed and studied in the field. In the 3D object detection process, classifications are centered on the object’s size, position, and direction. And in 6D pose assumptions, networks emphasize 3D translation and rotation vectors. Successful application of these strategies can have a huge impact on various machine learning-based applications, including the autonomous vehicles, the robotics industry, and the augmented reality sector. Although extensive work has been done on 3D object detection with a pose assumption from RGB images, the challenges have not been fully resolved. Our analysis provides a comprehensive review of the proposed contemporary techniques for complete 3D object detection and the recovery of 6D pose assumptions of an object. In this review research paper, we have discussed several proposed sophisticated methods in 3D object detection and 6D pose estimation, including some popular data sets, evaluation matrix, and proposed method challenges. Most importantly, this study makes an effort to offer some possible future directions in 3D object detection and 6D pose estimation. We accept the autonomous vehicle as the sample case for this detailed review. Finally, this review provides a complete overview of the latest in-depth learning-based research studies related to 3D object detection and 6D pose estimation systems and points out a comparison between some popular frameworks. To be more concise, we propose a detailed summary of the state-of-the-art techniques of modern deep learning-based object detection and pose estimation models.
- Research Article
22
- 10.3390/s23084005
- Apr 15, 2023
- Sensors (Basel, Switzerland)
This paper presents a benchmark analysis of NVIDIA Jetson platforms when operating deep learning-based 3D object detection frameworks. Three-dimensional (3D) object detection could be highly beneficial for the autonomous navigation of robotic platforms, such as autonomous vehicles, robots, and drones. Since the function provides one-shot inference that extracts 3D positions with depth information and the heading direction of neighboring objects, robots can generate a reliable path to navigate without collision. To enable the smooth functioning of 3D object detection, several approaches have been developed to build detectors using deep learning for fast and accurate inference. In this paper, we investigate 3D object detectors and analyze their performance on the NVIDIA Jetson series that contain an onboard graphical processing unit (GPU) for deep learning computation. Since robotic platforms often require real-time control to avoid dynamic obstacles, onboard processing with a built-in computer is an emerging trend. The Jetson series satisfies such requirements with a compact board size and suitable computational performance for autonomous navigation. However, a proper benchmark that analyzes the Jetson for a computationally expensive task, such as point cloud processing, has not yet been extensively studied. In order to examine the Jetson series for such expensive tasks, we tested the performance of all commercially available boards (i.e., Nano, TX2, NX, and AGX) with state-of-the-art 3D object detectors. We also evaluated the effect of the TensorRT library to optimize a deep learning model for faster inference and lower resource utilization on the Jetson platforms. We present benchmark results in terms of three metrics, including detection accuracy, frame per second (FPS), and resource usage with power consumption. From the experiments, we observe that all Jetson boards, on average, consume over 80% of GPU resources. Moreover, TensorRT could remarkably increase inference speed (i.e., four times faster) and reduce the central processing unit (CPU) and memory consumption in half. By analyzing such metrics in detail, we establish research foundations on edge device-based 3D object detection for the efficient operation of various robotic applications.
- Conference Article
3
- 10.1109/rivf55975.2022.10013923
- Dec 20, 2022
Object detection is one of the important applications and plays a decisive role in the research related to the field of image processing, in which 3D object detection has been a challenge in the field of self-driving cars. Many research papers on 3D image detection have been published, but each has its applications and limitations. To have an overview of 3D object detection based on which suitable methods for related applications are proposed, the paper makes a survey and classification based on the image-based method, point-cloud-based method and fusion-based method. In this paper, the authors will discuss, evaluate, and classify the most recent research on 3D object recognition and detection used in the field of self-driving cars with strengths and limitations. Besides, challenges to current successful 3D objection techniques and future research suggestions are also analyzed.
- Research Article
16
- 10.34133/cbsystems.0079
- Jan 1, 2024
- Cyborg and bionic systems (Washington, D.C.)
The fusion of millimeter-wave radar and camera modalities is crucial for improving the accuracy and completeness of 3-dimensional (3D) object detection. Most existing methods extract features from each modality separately and conduct fusion with specifically designed modules, potentially resulting in information loss during modality transformation. To address this issue, we propose a novel framework for 3D object detection that iteratively updates radar and camera features through an interaction module. This module serves a dual purpose by facilitating the fusion of multi-modal data while preserving the original features. Specifically, radar and image features are sampled and aggregated with a set of sparse 3D object queries, while retaining the integrity of the original radar features to prevent information loss. Additionally, an innovative radar augmentation technique named Radar Gaussian Expansion is proposed. This module allocates radar measurements within each voxel to neighboring ones as a Gaussian distribution, reducing association errors during projection and enhancing detection accuracy. Our proposed framework offers a comprehensive solution to the fusion of radar and camera data, ultimately leading to heightened accuracy and completeness in 3D object detection processes. On the nuScenes test benchmark, our camera-radar fusion method achieves state-of-the-art 3D object detection results with a 41.6% mean average precision and 52.5% nuScenes detection score.
- Research Article
6
- 10.3390/rs15030627
- Jan 20, 2023
- Remote Sensing
Three dimensional (3D) object detection with an optical camera and light detection and ranging (LiDAR) is an essential task in the field of mobile robot and autonomous driving. The current 3D object detection method is based on deep learning and is data-hungry. Recently, semi-supervised 3D object detection (SSOD-3D) has emerged as a technique to alleviate the shortage of labeled samples. However, it is still a challenging problem for SSOD-3D to learn 3D object detection from noisy pseudo labels. In this paper, to dynamically filter the unreliable pseudo labels, we first introduce a self-paced SSOD-3D method SPSL-3D. It exploits self-paced learning to automatically adjust the reliability weight of the pseudo label based on its 3D object detection loss. To evaluate the reliability of the pseudo label in accuracy, we present prior knowledge based SPSL-3D (named as PSPSL-3D) to enhance the SPSL-3D with the semantic and structure information provided by a LiDAR-camera system. Extensive experimental results in the public KITTI dataset demonstrate the efficiency of the proposed SPSL-3D and PSPSL-3D.
- Research Article
15
- 10.1109/jsen.2021.3101497
- Oct 1, 2021
- IEEE Sensors Journal
Three-dimensional (3D) object detection is of great significance for avoiding collisions between vehicles and obstacles in autonomous driving. In particular, the recent 3D object detection methods based on supervised learning are widely studied to achieve excellent performance. However, the 3D labels for training in such methods are expensive and often difficult to be collected. To solve this issue, we propose a monocular 3D vehicle detection method. First, we propose a general mathematical K-means-like method for clustering arbitrary object contours into linear equations. Second, the position, orientation and dimensions of the vehicle can be estimated by applying K-means-like method without the need for 3D labels in the contour of the vehicle. Finally, given the 2D object detection, we maximize a posterior probability of vehicle position, orientation and dimensions to improve the accuracy of the 3D object detection based on the results of K-means-like method. We evaluate the proposed algorithm on the dataset collected by the vehicle-side and road-side cameras in the cooperative vehicle infrastructure system (CVIS). Compared with the state-of-art Deep3DBox and SMOKE methods, the evaluated results show that the detection accuracy of 3D object of our method is 1.4% higher than that of Deep3DBox in the vehicle-side system, while for the road-side camera, the proposed method has 3.86% and 4.37% higher accuracy than Deep3DBox and SMOKE, respectively. Thus, the proposed method can be seen as an effective 3D object detection method in the intelligent transportation system and CVIS.
- Research Article
15
- 10.3390/s21041213
- Feb 9, 2021
- Sensors
Instance segmentation and object detection are significant problems in the fields of computer vision and robotics. We address those problems by proposing a novel object segmentation and detection system. First, we detect 2D objects based on RGB, depth only, or RGB-D images. A 3D convolutional-based system, named Frustum VoxNet, is proposed. This system generates frustums from 2D detection results, proposes 3D candidate voxelized images for each frustum, and uses a 3D convolutional neural network (CNN) based on these candidates voxelized images to perform the 3D instance segmentation and object detection. Results on the SUN RGB-D dataset show that our RGB-D-based system’s 3D inference is much faster than state-of-the-art methods, without a significant loss of accuracy. At the same time, we can provide segmentation and detection results using depth only images, with accuracy comparable to RGB-D-based systems. This is important since our methods can also work well in low lighting conditions, or with sensors that do not acquire RGB images. Finally, the use of segmentation as part of our pipeline increases detection accuracy, while providing at the same time 3D instance segmentation.
- Research Article
60
- 10.1109/tits.2022.3219474
- Aug 1, 2023
- IEEE Transactions on Intelligent Transportation Systems
Fully automated vehicles collect information about their road environments to adjust their driving actions, such as braking and slowing down. The development of artificial intelligence (AI) and the Internet of Things (IoT) has improved the cognitive abilities of vehicles, allowing them to detect traffic signs, pedestrians, and obstacles for increasing the intelligence of these transportation systems. Three-dimensional (3D) object detection in front-view images taken by vehicle cameras is important for both object detection and depth estimation. In this paper, a joint channel attention and multidimensional regression loss method for 3D object detection in automated vehicles (called CAMRL) is proposed to improve the average precision of 3D object detection by focusing on the model's ability to infer the locations and sizes of objects. First, channel attention is introduced to effectively learn the yaw angles from the road images captured by vehicle cameras. Second, a multidimensional regression loss algorithm is designed to further optimize the size and position parameters during the training process. Third, the intrinsic parameters of the camera and the depth estimate of the model are combined to reduce the object depth computation error, allowing us to calculate the distance between an object and the camera after the object's size is confirmed. As a result, objects are detected, and their depth estimations are validated. Then, the vehicle can determine when and how to stop if an object is nearby. Finally, experiments conducted on the KITTI dataset demonstrate that our method is effective and performs better than other baseline methods, especially in terms of 3D object detection and bird's-eye view (BEV) evaluation.
- Conference Article
3
- 10.1109/ccdc55256.2022.10034359
- Aug 15, 2022
Accurate object detection is a fundamental requirement for autonomous systems to operate in dynamic urban environments. Considering the intricacy of the environment and the occlusion of objects, various point cloud based three-dimensional (3D) object detection methods have been proposed, such as point-based or voxel-based methods. In this paper, a feature selection mechanism is proposed in a voxel-based method to generate bird’s eye view (BEV) feature maps from the original point cloud. For the 3D Backbone, the U-shaped structure and single scale feature selection module are combined. After combining high-level semantics and low-level fine-grained features, the optimized features after the BEV feature maps are applied to region of interest (RoI) refinement, so that the voxel features can better serve the subsequent 3D object detection. The experimental results on KITTI dataset show higher 3D object detection accuracy compared to the state-of-the-art 3D detection methods, which reflects the effectiveness of the proposed architecture.
- Research Article
- 10.26689/jera.v9i2.9698
- Feb 24, 2025
- Journal of Electronic Research and Application
Three-dimensional (3D) object detection is crucial for applications such as robotic control and autonomous driving. While high-precision sensors like LiDAR are expensive, RGB-D sensors (e.g., Kinect) offer a cost-effective alternative, especially for indoor environments. However, RGB-D sensors still face limitations in accuracy and depth perception. This paper proposes an enhanced method that integrates attention-driven YOLOv9 with xLSTM into the F-ConvNet framework. By improving the precision of 2D bounding boxes generated for 3D object detection, this method addresses issues in indoor environments with complex structures and occlusions. The proposed approach enhances detection accuracy and robustness by combining RGB images and depth data, offering improved indoor 3D object detection performance.
- Research Article
6
- 10.1117/1.jei.31.5.053025
- Oct 6, 2022
- Journal of Electronic Imaging
Most of the three-dimensional (3D) object detection methods based on LiDAR point cloud data achieve relatively high performance in general cases. However, when the LiDAR points have noise or some corruptions, the detection performance can be severely affected. We propose a 3D object detection method that combines point cloud information with two-dimensional (2D) semantic segmentation information to enhance the feature representation for difficult cases, such as sparse, noisy, and partially absent data. Motivated by the Pointpainting techniques, we designed an early-stage fusion method based on a Voxel region-based convolutional neural network (R-CNN) architecture. The 2D semantic segmentation scores obtained by the Pointpainting techniques are appended to the raw point cloud data. The voxel-based features and 2D semantic information improve the performance in detecting instances when the point cloud is corrupted. In addition, we also designed a multiscale hierarchical region of interest pooling strategy that reduced the computational cost of Voxel R-CNN by at least 43%. Our method shows competitive results with the state-of-the-art methods on the standard KITTI dataset. In addition, three corrupted KITTI datasets, KITTI sparse (KITTI-S), KITTI jittering (KITTI-J), and KITTI dropout (KITTI-D), were used for robustness testing. With the noisy LiDAR points, our proposed point painted Voxel R-CNN achieved superior detection performance over that of the baseline Voxel R-CNN for the moderate case, with a notable improvement of 11.13% in average precision (AP) on the 3D object detection and 14.3% in AP on the bird’s eye view object detection.
- Book Chapter
1
- 10.1007/978-3-030-84522-3_65
- Jan 1, 2021
With the rapid development of autonomous vehicles, three-dimensional (3D) object detection has become more important, whose purpose is to perceive the size and accurate location of objects in the real world. Many kinds of LiDAR-camera-based 3D object detectors have been developed with two heavy neural networks to extract view-specific features, while a LiDAR-camera-based 3D detector runs very slow about 10 frames per second (FPS). To tackle this issue, this paper first presents an accuracy and efficiency multiple-sensor framework with an early-fusion method to exploit both LiDAR and camera data for fast 3D object detection. Moreover, we also present a lightweight attention fusion module to further improve the performance of our proposed framework. Massive experiments evaluated on the KITTI benchmark suite show that the proposed approach outperforms state-of-the-art LiDAR-camera-based methods on the three classes in 3D performance. Additionally, the proposed model runs at 23 frames per second (FPS), which is almost 2× faster than state-of-the-art fusion methods for LiDAR and camera.