A transformer-enhanced geometric learning framework for robust monocular visual odometry
Monocular visual odometry (VO) is crucial for ego-motion estimation in autonomous systems, but it suffers from scale ambiguity, computational inefficiencies, and poor generalization across different motion scales. In this paper, we introduce a novel end-to-end monocular VO framework that combines convolutional feature extraction with Transformer-based spatial-temporal feature modeling. Our framework directly utilizes image patches and spatial coordinates instead of traditional descriptors, thus improving efficiency for monocular VO where inter-frame motion is typically small. We integrate a multi-scale feature extraction model into the SuperPoint network using a Feature Pyramid Network (FPN) to address the scale ambiguity. In addition, we design a hierarchical Transformer that enhances feature matching by incorporating spatial-temporal-aware attention, guided by geometric priors, to improve robustness in challenging scenes. A joint loss function that combines pose loss, geometric consistency, and feature association, coupled with curriculum learning, ensures effective generalization. Evaluated on the KITTI data set, our method demonstrates superior trajectory estimation accuracy compared to existing state-of-the-art learning-based models such as DeepVO and TSformer-VO, and it achieves competitive or superior performance compared to traditional methods like ORB-SLAM3, especially in challenging sequences.
- Conference Article
1
- 10.1063/5.0043916
- Jan 1, 2021
Pose estimation is a key aspect of mobile robots or moving vehicles in computer vision and robotics applications when it travels through an environment. Monocular visual odometry can provide improved pose estimation compared to wheel odometry method when travelling on slippery or smooth surface. Method of estimating trajectory of camera mounted on vehicle using monocular visual odometry. Fast and robust image feature detection and matching techniques are very important for visual odometry. To test the effect of salt and pepper noise on visual odometry, we added noise to KITTI dataset. There is need to investigate the performances of SURF (Speeded-Up Robust Features), KAZE and MSER (Maximally Stable Extremal Regions) in presence of salt and pepper noise to obtain good feature detection and matching because good feature matching provides good visual odometry. Salt and pepper noise can affect the accuracy of visual odometry so there is need to study and find the robust feature detector in presence of noise to obtain good camera trajectory. There is need to investigate and compare the performances of SURF (Speeded-Up Robust Features), KAZE and MSER (Maximally Stable Extremal Regions) in presence of salt and pepper noise to obtain good camera trajectory. By mixing noise with KITTI dataset images and matching rate is identified and robust feature detector is one which provides good feature matching rate. Trajectories are plotted by mixing noise 5% with images of KITTI dataset and tested the effect of salt and pepper noise on trajectories. Plotted trajectories are compared with ground truth and RMSE is calculated. Visual odometry trajectories are plotted using SURF, KAZE and MSER and these trajectories are compared with ground truth. Experiments performed using KITTI dataset have shown that visual odometry obtained using KAZE is better than SURF and MSER in presence noise but requires more time. Root means square error (RMSE) for KAZE is less than SURF and MSER.
- Conference Article
3
- 10.1109/icassp.2019.8683446
- May 1, 2019
In this paper, we present iDVO (inertia-embedded deep visual odometry), a self-supervised learning based monocular visual odometry (VO) for road vehicles. When modelling the geometric consistency within adjacent frames, most deep VO methods ignore the temporal continuity of the camera pose, which results in a very severe jagged fluctuation in the velocity curves. With the observation that road vehicles tend to perform smooth dynamic characteristics in most of the time, we design the inertia loss function to describe the abnormal motion variation, which assists the model to learn the consecutiveness from long-term camera ego-motion. Based on the recurrent convolutional neural network (RCNN) architecture, our method implicitly models the dynamics of road vehicles and the temporal consecutiveness by the extended Long Short-Term Memory (LSTM) block. Furthermore, we develop the dynamic hard-edge mask to handle the non-consistency in fast camera motion by blocking the boundary part and which generates more efficiency in the whole non-consistency mask. The proposed method is evaluated on the KITTI dataset, and the results demonstrate state-of-the-art performance with respect to other monocular deep VO and SLAM approaches.
- Conference Article
3
- 10.1109/icarsc.2019.8733633
- Apr 1, 2019
Developing ground robots for crop monitoring and harvesting in steep slope vineyards is a complex challenge due to two main reasons: harsh condition of the terrain and unstable localization accuracy obtained with Global Navigation Satellite System. In this context, a reliable localization system requires an accurate and redundant information to Global Navigation Satellite System and wheel odometry based system. To pursue this goal we benchmark 3 well known Visual Odometry methods with 2 datasets. Two of these are feature-based Visual Odometry algorithms: Libviso2 and SVO 2.0. The third is an appearance-based Visual Odometry algorithm called DSO. In monocular Visual Odometry, two main problems appear: pure rotations and scale estimation. In this paper, we focus on the first issue. To do so, we propose a Kalman Filter to fuse a single gyroscope with the output pose of monocular Visual Odometry, while estimating gyroscope's bias continuously. In this approach we propose a non-linear noise variation that ensures that bias estimation is not affected by Visual Odometry resultant rotations. We compare and discuss the three unchanged methods and the three methods with the proposed additional Kalman Filter. For tests, two public datasets are used: the Kitti dataset and another built in-house. Results show that our additional Kalman Filter highly improves Visual Odometry performance in rotation movements.
- Research Article
16
- 10.1109/lra.2017.2651385
- Apr 1, 2017
- IEEE Robotics and Automation Letters
Drift and scale ambiguity are two main issues which reduce localization accuracy in monocular visual odometry (MVO). It is necessary to propose a unified model to represent these measurement uncertainties. In this paper, we present a geometric map-assisted localization approach for mobile robots equipped with MVO. We model the measurement of MVO as a group of particles, which obey uniform-Gaussian distribution and cover measurement uncertainties including scale ambiguity and measurement randomness. The saliency of each particle can be obtained from the distribution to indicate raw measurement certainty of MVO. Geometric map-assisted shape matching is implemented as the measurement model to assign consistency to the particles generated from the distribution. Both saliency and consistency are considered in particle weights determination. Furthermore, based on the statistical properties of the probability distribution, a parameter estimation scheme is proposed to narrow down the scale ambiguity of MVO while resampling particles. Experiments with KITTI dataset have demonstrated that the proposed approach greatly enhances positioning accuracy, with average localization error of 6.54 m in over 15.89 km run.
- Conference Article
2
- 10.1109/globecom38437.2019.9013757
- Dec 1, 2019
Monocular visual odometry (VO) is one of the most practical ways in vehicle autonomous positioning, through which a vehicle can automatically locate itself in a completely unknown environment. Although some existing VO algorithms have proved the superiority, they usually need another precise adjustment to operate well when using a different camera or in different environments. The existing VO methods based on deep learning require few manual calibration, but most of them occupy a tremendous amount of computing resources and cannot realize real-time VO. We propose a highly real-time VO system based on the optical flow and DenseNet structure accompanied with the inertial measurement unit (IMU). It cascade the optical flow network and DenseNet structure to calculate the translation and rotation, using the calculated information and IMU for construction and self- correction of the map. We have verified its computational complexity and performance on the KITTI dataset. The experiments have shown that the proposed system only requires less than 50% computation power than the main stream deep learning VO. It can also achieve 30% higher translation accuracy as well.
- Research Article
9
- 10.23919/jcc.2020.06.016
- Jun 1, 2020
- China Communications
Error or drift is frequently produced in pose estimation based on geometric "feature detection and tracking" monocular visual odometry(VO) when the speed of camera movement exceeds 1.5m/s. While, in most VO methods based on deep learning, weight factors are in the form of fixed values, which are easy to lead to overfitting. A new measurement system, for monocular visual odometry, named Deep Learning Visual Odometry(DLVO), is proposed based on neural network. In this system, Convolutional Neural Network(CNN) is used to extract feature and perform feature matching. Moreover, Recurrent Neural Network(RNN) is used for sequence modeling to estimate camera's 6-dof poses. Instead of fixed weight values of CNN, Bayesian distribution of weight factors are introduced in order to effectively solve the problem of network over-fitting. The 18,726 frame images in KITTI dataset are used for training network. This system can increase the generalization ability of network model in prediction process. Compared with original Recurrent Convolutional Neural Network(RCNN), our method can reduce the loss of test model by 5.33%. And it's an effective method in improving the robustness of translation and rotation information than traditional VO methods.
- Conference Article
51
- 10.1109/icra.2019.8793581
- May 1, 2019
Inspired by the cognitive process of humans and animals, Curriculum Learning (CL) trains a model by gradually increasing the difficulty of the training data. In this paper, we study whether CL can be applied to complex geometry problems like estimating monocular Visual Odometry (VO). Unlike existing CL approaches, we present a novel CL strategy for learning the geometry of monocular VO by gradually making the learning objective more difficult during training. To this end, we propose a novel geometry-aware objective function by jointly optimizing relative and composite transformations over small windows via bounded pose regression loss. A cascade optical flow network followed by recurrent network with a differentiable windowed composition layer, termed CL-VO, is devised to learn the proposed objective. Evaluation on three real-world datasets shows superior performance of CL-VO over state-of-the-art feature-based and learning-based VO.
- Research Article
97
- 10.1007/s00371-019-01714-6
- Jun 25, 2019
- The Visual Computer
Monocular visual odometry provides more robust functions on navigation and obstacle avoidance for mobile robots than other visual odometries, such as binocular visual odometry, RGB-D visual odometry and basic odometry. This paper describes the problem of visual odometry and also determines the relationships between visual odometry and visual simultaneous localization and mapping (SLAM). The basic principle of visual odometry is expressed in the form of mathematics, specifically by incrementally solving the pose changes of two series of frames and further improving the odometry through global optimization. After analyzing the three main ways of implementing visual odometry, the state-of-the-art monocular visual odometries, including ORB-SLAM2, DSO and SVO, are also analyzed and compared in detail. The issues of robustness and real-time operations, which are generally of interest in the current visual odometry research, are discussed from the future development of the directions and trends. Furthermore, we present a novel framework for the implementation of next-generation visual odometry based on additional high-dimensional features, which have not been implemented in the relevant applications.
- Conference Article
1
- 10.1109/rcar52367.2021.9517422
- Jul 15, 2021
In recent years, visual-inertial simultaneous localization and mapping (SLAM) have been widely researched and used. The monocular vision-based depth estimation and visual odometry (VO) plays a significant role in the SLAM system because of its low cost and high efficiency, which can be used to analyze the indoor environment for intelligent move robot application. Existing methods generally aim at the difference in pixel contrast between the previous and later frames to obtain the environment structure and camera pose. But the indifferent attention to the overall image makes the network spend extra computing power and hard to obtain the ideal results. For overcoming these disadvantages, an attention mechanism is proposed on basis of the original network and modified some convolution methods to make the network pay more attention to the key areas in the image and improve the accuracy of the estimation. The proposed approach is evaluated on the KITTI dataset, compared with the state-of-the-art methods achieves better results.
- Research Article
8
- 10.1109/jsen.2022.3227017
- Jan 15, 2023
- IEEE Sensors Journal
Self-supervised monocular depth and visual odometry (VO) are often cast as coupled tasks. Accurate depth contributes to precise pose estimation and vice versa. Existing architectures typically exploit stacking convolution layers and long short-term memory (LSTM) units to capture long-range dependencies. However, their intrinsic locality hinders the model from getting the expected performance gain. In this article, we propose a Transformer-based architecture, named Transformer-based self-supervised monocular depth and VO (TSSM-VO), to tackle these problems. It comprises two main components: 1) a depth generator that leverages the powerful capability of multihead self-attention (MHSA) on modeling long-range spatial dependencies and 2) a pose estimator built upon a Transformer to learn long-range temporal correlations of image sequences. Moreover, a new data augmentation loss based on structural similarity (SSIM) is introduced to constrain further the structural similarity between the augmented depth and the augmented predicted depth. Rigorous ablation studies and exhaustive performance comparison on the KITTI and Make3D datasets demonstrate the superiority of TSSM-VO over other self-supervised methods. We expect that TSSM-VO would enhance the ability of intelligent agents to understand the surrounding environments.
- Conference Article
2
- 10.1109/ispds51347.2020.00009
- Aug 1, 2020
Monocular visual odometry (VO) based on deep learning methods is one of the most important research tasks in computer vision in recent decades. Unlike unsupervised and supervised methods, Self-supervised learning can add additional image information to improve the accuracy and robustness of the model without real labels, so the self-supervised visual odometry methods have received widespread attention. However, most self-supervised VO methods do not make full use of self-supervised signals, especially optical flow as a self-supervised signal. In this paper, we make full use of optical flow as a self-supervised signal, a new self-supervised network structure is proposed. This self-supervised network structure adds optical flow as a self-supervised signal to the basic cascaded VO estimation network. While providing additional information to train the camera pose estimation, the optical flow is used to generate mask to solve the occlusion problems that often occur in self-supervised models. We evaluated the self-supervised visual odometry model proposed in this paper on the KITTI dataset. Experiments show that our method has achieved good performance in depth estimation and camera pose estimation.
- Conference Article
1
- 10.1109/iscit.2010.5665126
- Oct 1, 2010
In this research, we propose state-of-the-art 6-Degrees-Of-Freedom (6DOF) visual odometry (VO) system which fused stereo VO with effective feature sampling, VO using normal vector information of a ground plane and monocular VO. Firstly, stereo VO is used to do 6DOF motion estimation in untextured dynamic environments. Secondly, ground plane information is used to improve the accuracy of roll and pitch angles estimation. Thirdly, monocular VO is introduced to solve the problems in stereo processing. Finally, Unscented Kalman Filter is adopted for the fusion of the information acquired from three kinds of VO techniques to accomplish robust and accurate localization. We present and evaluate experimental results for our system over 730 meters runs in a challenging outdoor environment and compare it with ground truth. Proposed VO system can localize the robot's position within 3.2% error in untextured outdoor environment.
- Conference Article
7
- 10.1109/cpere56564.2023.10119592
- Feb 19, 2023
Recently, autonomous systems have been in rapid development. These systems are composed of various modules that enable automated navigation. The main generic modules consist of data acquisition, perceiving the environment, finding the optimal path and controlling the actuators. A core module of the automated vehicles is estimating where the vehicle is with respect to the path to be traversed, namely, vehicle localization. Localization has been applied through the past few years based on various sensors such as LiDARs, Cameras and GPS. Each sensor has its own strengths and weaknesses. For instance, Camera and LiDAR-Based traversed path estimation Odometry can function within GPS-denied environments. On the other hand, localization based on GPS can be used in environments where Cameras and LiDARs can fail to extract useful information, such as in a desert. For the case of automated vehicles, there is access to high power computers on board, which can help run complex and computationally expensive algorithms. However, for smaller platforms such as smart bikes, they are computationally limited, which can be a challenge especially for Visual Odometry Based algorithms. Given that there have been few researches exploring Monocular Visual Odometry (MVO) algorithms that are computationally efficient, this paper proposes the use of and enhancing a MVO module to estimate the path traversed by a smart bike platform in real-life. The obtained scores were compared to the output of the tested algorithm on a consumer grade PC to explore the trade-off between the gained speed up and reduced accuracy, the to validate the results.
- Research Article
19
- 10.1109/tits.2017.2685436
- Dec 1, 2017
- IEEE Transactions on Intelligent Transportation Systems
There are two main challenges, drift and scale ambiguity, restricting monocular visual odometry from an extensive application on real autonomous navigation. In this paper, an iterative localization framework is presented to globally localize a mobile vehicle equipped with a single camera and a freely available digital map. Inspired by the concept of cloud, a new Gaussian–Gaussian Cloud model is proposed to give a unified representation of the measurement randomness and scale ambiguity in monocular visual odometry. In this model, a collection of cloud drops is generated. Both the drift and scale ambiguity are considered and represented simultaneously in each cloud drop. To reduce the measurement uncertainties of any drop in Gaussian–Gaussian Cloud, road constraints from the open source map—OpenStreetMap—are utilized. The map is first converted to a template edge map and a shape matching step is then implemented to assign the probability of each cloud drop, indicating to what degree the drop accords with road constraints. A parameter estimation scheme is used to narrow down the scale ambiguity of monocular visual odometry while resampling cloud drops. Evaluations on the KITTI benchmark data set and our self-collected data set have demonstrated the stability and accuracy of the proposed approach.
- Conference Article
- 10.1145/3175603.3175619
- Dec 29, 2017
Monocular Visual odometry is an important technique in mobile robot localization and navigation. This paper first empirically studies two kinds of commonly used monocular visual odometry (MVO): descriptor-based methods and optical flow based methods. Six representative scenes are extracted from KITTI and Karlsruhe datasets. Ten MVO algorithms are evaluated in terms of real-time performance and trajectory accuracy. Experimental results show that different MVO algorithms show different performance in different scenarios. Furthermore, an adaptive visual odometry(AVO) strategy is proposed on the basis of the experiment results. The changing environment is detected and the most suitable MVO algorithm is chosen dynamically according to a cost function. The experimental results show that the AVO method can obtain higher trajectory accuracy and better real-time performance.
- New
- Research Article
- 10.1177/01423312251374131
- Nov 6, 2025
- Transactions of the Institute of Measurement and Control
- New
- Research Article
- 10.1177/01423312251379132
- Nov 5, 2025
- Transactions of the Institute of Measurement and Control
- New
- Research Article
- 10.1177/01423312251381455
- Nov 5, 2025
- Transactions of the Institute of Measurement and Control
- New
- Research Article
- 10.1177/01423312251369940
- Nov 4, 2025
- Transactions of the Institute of Measurement and Control
- New
- Research Article
- 10.1177/01423312251378609
- Nov 2, 2025
- Transactions of the Institute of Measurement and Control
- Research Article
- 10.1177/01423312251379543
- Oct 28, 2025
- Transactions of the Institute of Measurement and Control
- Research Article
- 10.1177/01423312251376967
- Oct 28, 2025
- Transactions of the Institute of Measurement and Control
- Research Article
- 10.1177/01423312251370947
- Oct 28, 2025
- Transactions of the Institute of Measurement and Control
- Research Article
- 10.1177/01423312251374953
- Oct 28, 2025
- Transactions of the Institute of Measurement and Control
- Research Article
- 10.1177/01423312251376778
- Oct 26, 2025
- Transactions of the Institute of Measurement and Control
- Ask R Discovery
- Chat PDF
AI summaries and top papers from 250M+ research sources.