Deep Learning in Visual Odometry for Autonomous Driving
Abstract. Positioning, Navigation, and Timing (PNT) solutions are fundamental for autonomous driving, ensuring reliable localization for safe vehicle control in diverse environments. While GNSS-based systems provide absolute positioning, they become unreliable in GNSS-denied scenarios such as urban canyons or tunnels. Dead reckoning techniques, including Visual Odometry (VO), offer an alternative by estimating motion from onboard sensors. Integrating these methods with deep learning (DL) has shown potential for enhancing robustness, particularly in challenging conditions. This study, part of the VAIPOSA ESA project, investigates the performance of VO solutions under various environmental conditions using a simulation-based approach. The CARLA simulator provides controlled testing scenarios, enabling the evaluation of VO accuracy across different weather conditions, illumination changes, and dynamic environments. A synthetic stereo setup enables capturing error-free ground truth trajectories and fair evaluation of the VO methods. Multiple sequences are analyzed, reflecting real-world challenges such as poor visibility, texture variations, and occlusions. The findings highlight the influence of environmental factors and dynamic objects on VO performance and the role of DL in mitigating common failure modes.
- Research Article
3
- 10.7717/peerj-cs.1628
- Oct 10, 2023
- PeerJ Computer Science
Simultaneous localization and mapping (SLAM) is a fundamental problem in robotics and computer vision. It involves the task of a robot or an autonomous system navigating an unknown environment, simultaneously creating a map of the surroundings, and accurately estimating its position within that map. While significant progress has been made in SLAM over the years, challenges still need to be addressed. One prominent issue is robustness and accuracy in dynamic environments, which can cause uncertainties and errors in the estimation process. Traditional methods using temporal information to differentiate static and dynamic objects have limitations in accuracy and applicability. Nowadays, many research trends have leaned towards utilizing deep learning-based methods which leverage the capabilities to handle dynamic objects, semantic segmentation, and motion estimation, aiming to improve accuracy and adaptability in complex scenes. This article proposed an approach to enhance monocular visual odometry's robustness and precision in dynamic environments. An enhanced algorithm using the semantic segmentation algorithm DeeplabV3+ is used to identify dynamic objects in the image and then apply the motion consistency check to remove feature points belonging to dynamic objects. The remaining static feature points are then used for feature matching and pose estimation based on ORB-SLAM2 using the Technical University of Munich (TUM) dataset. Experimental results show that our method outperforms traditional visual odometry methods in accuracy and robustness, especially in dynamic environments. By eliminating the influence of moving objects, our method improves the accuracy and robustness of visual odometry in dynamic environments. Compared to the traditional ORB-SLAM2, the results show that the system significantly reduces the absolute trajectory error and the relative pose error in dynamic scenes. Our approach has significantly improved the accuracy and robustness of the SLAM system's pose estimation.
- Research Article
- 10.5302/j.icros.2022.21.0212
- Feb 28, 2022
- Journal of Institute of Control, Robotics and Systems
Visual navigation technology enables the pose of a robot to be estimated and the surrounding environment to be perceived using a vision sensor mounted on the robot. This technology is essential to autonomous driving systems in unmanned mobile vehicles and has been actively researched in visual odometry (VO) and visual simultaneous localization and mapping (vSLAM). Generally, the vision-based navigation algorithms perform data association and pose estimation under the assumption that the brightness of surrounding environments does not change over time and that the scene obtained from vision sensors is static. However, in realistic industrial sites or urban environments, the brightness of the environment varies, and dynamic objects such as workers and cars are present. These conditions may lead to a decline in the reliability and performance of visual navigation. Research on robust visual navigation under environmental variations, such as illumination changes and dynamic circumstances, has sought to solve this problem. This study proposes a state-of-the-art robust visual navigation system that is robust to illumination changes and dynamic environments. Moreover, our analysis and classification is based on the methodology used in each robust visual navigation.
- Research Article
7
- 10.1016/j.compeleceng.2024.109127
- Feb 14, 2024
- Computers and Electrical Engineering
Towards explainable artificial intelligence in deep vision-based odometry
- Research Article
12
- 10.1088/1361-6501/abcc15
- Feb 15, 2021
- Measurement Science and Technology
To solve the accurate positioning problem of mobile robots, simultaneous localization and mapping (SLAM) or visual odometry (VO) based on visual information are widely used. However, most visual SLAM or VO cannot meet the accuracy requirements in dynamic indoor environments. This paper proposes a robust visual odometry based on deep learning to eliminate feature points matching error. However, when a camera and dynamic objects are in relative motion, the frames of camera will produce ghosting, especially in high-dynamic environments, which bring additional positioning error; in view of this problem, a novel method based on the average optical flow value of the dynamic region is proposed to identify feature points of the ghosting, and then the feature points of the ghosting and dynamic region are removed. After the remaining feature points are matched, we use a non-linear optimization method to calculate the pose. The proposed algorithm is tested on TUM RGB-D dataset, and the results show that our VO improves the positioning accuracy than other robust SLAM or VO and is strongly robust especially in high-dynamic environments.
- Conference Article
11
- 10.1109/iros40897.2019.8968208
- Nov 1, 2019
In the paper, we propose a robust real-time visual odometry in dynamic environments via rigid-motion model updated by scene flow. The proposed algorithm consists of spatial motion segmentation and temporal motion tracking. The spatial segmentation first generates several motion hypotheses by using a grid-based scene flow and clusters the extracted motion hypotheses, separating objects that move independently of one another. Further, we use a dual-mode motion model to consistently distinguish between the static and dynamic parts in the temporal motion tracking stage. Finally, the proposed algorithm estimates the pose of a camera by taking advantage of the region classified as static parts. In order to evaluate the performance of visual odometry under the existence of dynamic rigid objects, we use self-collected dataset containing RGB-D images and motion capture data for ground-truth. We compare our algorithm with state-of-the-art visual odometry algorithms. The validation results suggest that the proposed algorithm can estimate the pose of a camera robustly and accurately in dynamic environments.
- Research Article
- 10.1088/1742-6596/2078/1/012016
- Nov 1, 2021
- Journal of Physics: Conference Series
The traditional visual inertial odometry according to the manually designed rules extracts key points. However, the manually designed extraction rules are easy to be affected and have poor robustness in the scene of illumination and perspective change, resulting in the decline of positioning accuracy. Deep learning methods show strong robustness in key point extraction. In order to improve the positioning accuracy of visual inertial odometer in the scene of illumination and perspective change, deep learning is introduced into the visual inertial odometer system for key point detection. The encoder part of MagicPoint network is improved by depthwise separable convolution, and then the network is trained by self-supervised method; A visual inertial odometer system based on deep learning is compose by using the trained network to replace the traditional key points detection algorithm on the basis of VINS. The key point detection network is tested on HPatches dataset, and the odometer positioning effect is evaluated on EUROC dataset. The results show that the improved visual inertial odometer based on deep learning can reduce the positioning error by more than 5% without affecting the real-time performance.
- Research Article
- 10.54254/2753-8818/41/2024ch0187
- Nov 1, 2024
- Theoretical and Natural Science
Abstract. Visual simultaneous localization and mapping technology (VSLAM) provides a theoretical basis for the operation of unmanned equipment such as autonomous vehicles and sweeping robots in unfamiliar environments. Although traditional VSLAM systems have achieved great success after long-term development, it is still difficult to maintain good performance in challenging environments. Deep learning, as a newly developed technology in the field of vision in recent years, has shown outstanding advantages in image processing. Combining deep learning with VSLAM is a hot topic. Deep learning can help traditional VSLAM systems improve the lack of scale information in dynamic environments by improving the performance of traditional VSLAM in depth estimation, pose estimation, and closed loop detection. It can not only reduce the scale of the network model but also improve the accuracy of trajectory estimation. Specifically, in terms of the fusion of VSLAM method flow and deep learning, many researchers have proposed deep learning fusion methods based on visual odometry, loop detection and mapping. This work studies the trend and combination of VSLAM with deep learning algorithms, hoping to provide help for the real autonomy of future mobile robots, and finally puts forward prospects for the development of VSLAM.
- Book Chapter
2
- 10.1007/978-981-19-2635-8_68
- Sep 30, 2022
Visual odometry (VO) has recently attracted significant attention, as evidenced by the increasing interest in the development of autonomous mobile robots and vehicles. Studies have traditionally focused on geometry-based VO algorithms. These algorithms exhibit robust results under a restrictive setup, such as static and well-textured scenes. However, they are not accurate in challenging environments, such as changing illumination and dynamic environments. In recent years, VO algorithms based on deep learning methods have been developed and studied to overcome these limitations. However, there remains a lack of literature that provides a thorough comparative analysis of state-of-the-art deep learning-based monocular VO algorithms in challenging environments. This paper presents a comparison of four state-of-the-art monocular VO algorithms based on deep learning (DeepVO, SfMLearner, SC-SfMLearner, and DF-VO) in environments with glass walls, illumination changes, and dynamic objects. These monocular VO algorithms are based on supervised, unsupervised, and self-supervised learning integrated with multiview geometry. Based on the results of the evaluation on a variety of datasets, we conclude that DF-VO is the most suitable algorithm for challenging real-world environments.KeywordsMonocular Visual OdometryDeep LearningChallenging EnvironmentService Robot
- Research Article
2
- 10.1088/1361-6501/ad57dc
- Jun 27, 2024
- Measurement Science and Technology
The traditional visual inertial simultaneous localisation and mapping system does not fully consider the dynamic objects in the scene, which can reduce the quality of visual feature point matching. In addition, dynamic objects in the scene can cause illumination changes which reduce the performance of the visual front end and loop closure detection of the system. To address this problem, this study combines 3D light detection and ranging (LiDAR), camera, and inertial measurement units in a tightly coupled manner to estimate the pose of mobile robots, thereby proposing a robust LiDAR visual inertial odometry that can effectively filter out dynamic feature points. In addition, a dynamic feature point detection algorithm with attention mechanism is introduced for target detection and optical flow tracking. In experimental analyses on public datasets and real indoor scenes, the proposed method improved the accuracy and robustness of pose estimation in scenes with dynamic objects and varying illumination compared with traditional methods.
- Conference Article
- 10.23919/iccas47443.2019.8971455
- Oct 1, 2019
Estimating a camera pose in dynamic environments is one of the challenging problems in Visual Odometry. We propose an RGB-D Dense Visual Odometry (Dense-VO) system which uses preprocessed images that passed the Convolutional Neural Network (CNN). The algorithm adopts the CNN that tracks the designated dynamic object. The tracked dynamic object is excluded when the Dense-Vo estimates the camera motion by minimizing photometric error between consecutive images. The system was tested in two datasets which includes a dynamic object. The proposed approach containing the preprocessing procedure estimates the camera trajectory with less drift in a dynamic environment.
- Conference Article
5
- 10.1109/iecon.2018.8591053
- Oct 1, 2018
A novel RGB-D visual odometry method for dynamic environment is proposed. Majority of visual odometry systems can only work in static environments, which limits their applications in real world. In order to improve the accuracy and robustness of visual odometry in dynamic environment, a Feature Regions Segmentation algorithm is proposed to resist the disturbance caused by the moving objects. The matched features are divided into different regions to separate the moving objects from the static background. The features in the largest region which belong to the static background are used to estimate the camera pose finally. The effectiveness of our visual odometry method is verified in a dynamic environment of our lab. Furthermore, an exhaustive experimental evaluation is conducted on benchmark datasets including static environments and dynamic environments compared with the state-of-art visual odometry systems. The accuracy comparison results show that the proposed algorithm outperforms those systems in large scale dynamic environments. Our method tracks the camera movement correctly while others failed. In addition, our method can give the same good performances in static environment. Experiments demonstrate that the proposed RGB-D visual odometry can obtain accurate and robust estimation results in dynamic environments.
- Research Article
29
- 10.3390/app10041467
- Feb 21, 2020
- Applied Sciences
Traditional Simultaneous Localization and Mapping (SLAM) (with loop closure detection), or Visual Odometry (VO) (without loop closure detection), are based on the static environment assumption. When working in dynamic environments, they perform poorly whether using direct methods or indirect methods (feature points methods). In this paper, Dynamic-DSO which is a semantic monocular direct visual odometry based on DSO (Direct Sparse Odometry) is proposed. The proposed system is completely implemented with the direct method, which is different from the most current dynamic systems combining the indirect method with deep learning. Firstly, convolutional neural networks (CNNs) are applied to the original RGB image to generate the pixel-wise semantic information of dynamic objects. Then, based on the semantic information of the dynamic objects, dynamic candidate points are filtered out in keyframes candidate points extraction; only static candidate points are reserved in the tracking and optimization module, to achieve accurate camera pose estimation in dynamic environments. The photometric error calculated by the projection points in dynamic region of subsequent frames are removed from the whole photometric error in pyramid motion tracking model. Finally, the sliding window optimization which neglects the photometric error calculated in the dynamic region of each keyframe is applied to obtain the precise camera pose. Experiments on the public TUM dynamic dataset and the modified Euroc dataset show that the positioning accuracy and robustness of the proposed Dynamic-DSO is significantly higher than the state-of-the-art direct method in dynamic environments, and the semi-dense cloud map constructed by Dynamic-DSO is clearer and more detailed.
- Conference Article
4
- 10.1117/12.2586310
- Jul 16, 2021
In this paper, the application of monocular Visual Odometry (VO) solutions for underground train stopping operation are explored. In order to analyze if the application of monocular VO solutions in challenging environments as underground railway scenarios is viable, different VO architectures are selected. For that, the state of the art of deep learning based VO approaches is analyzed. Four categories can be defined in the VO approaches defined in the last few years: (1) supervised pure deep learning based solutions; (2) solutions combining geometric features and deep learning; (3) solutions combining inertial sensors and deep learning; and (4) unsupervised deep learning solutions. A dataset composed of underground train stop operations was also created, where the ground truth is labeled according to the onboard unit SIL-4 ERTMS/ETCS odometry data. The dataset was recorded by using a camera installed in front of the train. Preliminary experimental results demonstrate that deep learning based VO solutions are applicable in underground train stop operations.
- Research Article
29
- 10.1109/tro.2020.3031267
- Nov 13, 2020
- IEEE Transactions on Robotics
In this paper we present a data-driven approach to obtain the static image of a scene, eliminating dynamic objects that might have been present at the time of traversing the scene with a camera. The general objective is to improve vision-based localization and mapping tasks in dynamic environments, where the presence (or absence) of different dynamic objects in different moments makes these tasks less robust. We introduce an end-to-end deep learning framework to turn images of an urban environment that include dynamic content, such as vehicles or pedestrians, into realistic static frames suitable for localization and mapping. This objective faces two main challenges: detecting the dynamic objects, and inpainting the static occluded back-ground. The first challenge is addressed by the use of a convolutional network that learns a multi-class semantic segmentation of the image. The second challenge is approached with a generative adversarial model that, taking as input the original dynamic image and the computed dynamic/static binary mask, is capable of generating the final static image. This framework makes use of two new losses, one based on image steganalysis techniques, useful to improve the inpainting quality, and another one based on ORB features, designed to enhance feature matching between real and hallucinated image regions. To validate our approach, we perform an extensive evaluation on different tasks that are affected by dynamic entities, i.e., visual odometry, place recognition and multi-view stereo, with the hallucinated images. Code has been made available on https://github.com/bertabescos/EmptyCities_SLAM.
- Conference Article
- 10.1117/12.2574361
- Oct 10, 2020
The Point feature and line feature have been widely used in visual SLAM(simultaneous localization and mapping) algorithm. But most of these methods assume that the environments are static, ignoring that there are often dynamic objects in real world, which can degrade the SLAM performance. In order to solve this problem, a line-expanded visual odometry is proposed. It calculates optical flow between two adjacent frames to identify and eliminate dynamic point features in dynamic objects, and use the rest of point features to find the collinear relationship to expand line features for visual SLAM algorithm based on point features. Final it use the rest of point features and line features to estimate the camera pose. The proposed method not only reduces the influence of dynamic objects, but also avoids the tracking failure caused by few point features. The experiments are carried out on a TUM dataset. Compared with state-of-the-art methods like ORB (oriented FAST and rotated BRIEF) method and ORB add optical flow method, the results demonstrate that the proposed method reduces the tracking error and improve the robustness and accuracy of visual odometry in dynamic environments.
- Ask R Discovery
- Chat PDF
AI summaries and top papers from 250M+ research sources.