NeuroFusion-SLAM: A Deep Neural Network Framework for Real-Time Multi-Sensor SLAM.
While deep learning-based visual SLAM (VSLAM) has achieved remarkable localization accuracy, its high computational cost and latency remain critical bottlenecks for real-time deployment. To address these limitations, this paper presents NeuroFusion-SLAM, a novel multi-sensor fusion framework tailored for both efficiency and robustness. By incorporating depthwise separable convolution, the framework cuts down model parameters by approximately 40% and training time by 49% while preserving localization accuracy, thus boosting real-time inference performance and computational efficiency in large-scale environments. Furthermore, a global edge optimization strategy is proposed by integrating sliding window optimization with a factor graph framework, which effectively improves the global consistency of the system. Extensive experiments on the TUM-VI and KITTI-360 datasets demonstrate that our system achieves real-time performance with an average latency of 30.4 ms per frame. It runs 3× faster than ORB-SLAM2 and 4× faster than VINS-Mono, while maintaining good localization accuracy.
- Dissertation
- 10.32657/10356/163428
- Jan 1, 2022
Autonomous vehicles such as UAVs and AGVs have received increasing attentions over the past decades due to a wide range of applications in many areas. To accomplish robotic tasks intelligently, Simultaneous Localization and Mapping (SLAM) is considered as a fundamental capability for mobile robots. The objective of SLAM is to simultaneously estimate the poses of robots and build a map of the unknown environments from the data of on-board sensors. Driven by the demand for low-cost and high-efficiency solutions, the development of autonomous localization and navigation capabilities for autonomous vehicles using on-board sensors has become a popular research topic in robotics community. \nBenefiting from various low-cost and lightweight cameras off-the-shelf, vision-based SLAM, or visual SLAM, has played an important role in many robotic applications. A number of impressive visual SLAM systems have been proposed in the literature using different type of cameras, such as monocular SLAM, RGB-D SLAM, and stereo SLAM. These methods have demonstrated impressive performance in specific scenarios and environments. However, there remains open problems for visual SLAM in challenging environments. First, since most visual SLAM methods rely on features extracted from the environments to estimate the visual odometry of cameras, they may not perform well or even fail in low-textured environments with few features. Second, most of the existing visual SLAM methods heavily rely on a static world assumption and easily fail in dynamic environments involving moving objects. Third, the computational resources for small-scaled autonomous vehicles are often limited, hence the real-time performance needs to be taken into consideration. To address these issues, we aim to improve the robustness of visual SLAM for autonomous vehicles in challenging environments, and introduce both geometric and semantic methods. \nFor autonomous navigation and safe control of UAVs, it is essential to have accurate and reliable velocity and position estimation. However, due to limited computational power and payload, it is still challenging for autonomous operation of UAVs in complex environments. In the first part of the thesis, we propose a robust and efficient velocity estimation framework for MAVs in cluttered environment using a single downward-facing RGB-D camera. Our method is able to provide metric velocity estimation in three dimensions as well as yaw rate for MAVs without the fusion of additional sensors. Moreover, based on a fast optical flow computation method which does not rely on time-consuming feature detection and matching, our approach is able to run in real-time on MAVs. \nAlthough camera alone can provide satisfactory localization for autonomous vehicles, its navigation capability in real scenarios still remains challenging due to unreliable depth measurement and inaccurate global map building. In many industrial scenarios, 2D LiDAR is another popular localization device due to its relatively low cost and high accuracy. However, it can only work on 2D plane. To enhance the capabilities and intelligence of traditional AGVs equipped with 2D LiDAR, we propose an integration framework to combine the advantages of camera and 2D LiDAR for robust navigation in warehouse environments. 2D LiDAR has the advantage in providing accurate occupancy map which is essential for path planning, while it is unable to detect obstacles in 3D space. To solve this, we propose an effective obstacle detection method in 3D space using an RGB-D camera which can be directly used for obstacle avoidance in 2D LiDAR map-based navigation. \nTo track the position of vehicles using on-board cameras, most of the existing visual SLAM algorithms focus on points, either by feature matching or direct alignment of pixels, while ignoring other common but valuable geometry primitives such as lines and planes in the scenes. In low-textured environments, it is often difficult to find a sufficient number of point features and as a consequence, the performance of such algorithms degrades. To take full advantage of available geometric information in the environment, we propose a multi-landmark SLAM framework in the third part of the thesis which combines point, line and plane features to benefit both tracking and mapping for autonomous vehicles in indoor environments. For the tracking part, we develop an optimization framework that integrates different geometric features extracted from an RGB-D camera. For the mapping part, we combine different features to build a structural map of the environment. \nAll the above proposed systems and many other state-of-the-art visual SLAM solutions implicitly assume a static environment, which means that there are no moving objects in the camera FoV and the estimated motion only comes from the moving cameras. In consequence, the performance of these systems may degrade or even fail when there are moving objects in the scene, such as persons or other vehicles. Therefore, in the next two parts of the thesis, we aim to improve the robustness of visual SLAM in dynamic environments. We first propose a geometric method using KMeans clustering to detect dynamic parts in images which does not require prior information about the moving objects. With the development of deep learning, we further explore the integration of semantic information for visual SLAM in dynamic environments. To reduce the computational cost, we only perform semantic segmentation on keyframes to remove known dynamic objects, and maintain a static map for robust camera tracking. In addition, the geometry module in the previous part is integrated to handle unknown moving objects. Our system is able to run in real-time on a low-power embedded platform and provide high localization accuracy in dynamic environments.
- Research Article
2
- 10.1109/tase.2025.3599888
- Jan 1, 2025
- IEEE Transactions on Automation Science and Engineering
Visual Simultaneous Localization and Mapping (VSLAM) is a critical technology for autonomous driving and mobile robotics. Traditional VSLAM methods based on discrete representations, such as point clouds, offer high computational efficiency and excellent localization accuracy, but they exhibit limited robustness. In contrast, methods employing field representations, like Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3D GS), provide greater robustness at the expense of increased computational demands and reduced localization accuracy. Hybrid VSLAM approaches that attempt to combine these representations typically rely on serial, synchronous cascades, which compromise robustness, computational efficiency, and GPU memory usage. This paper introduces a novel adaptive cross-representation VSLAM framework that applies different representation modeling techniques to distinct regions of an image sequence and adopts asynchronous parallel modeling in overlapping regions. A Vision Language Model (VLM) is used to analyze the image sequence, enabling the detection of representation modeling regions and adaptive switching between representations. Cross-representation data association is performed through a coarse-to-fine feature selection process, resulting in a globally consistent map. The proposed method is evaluated on both public and custom-collected datasets, where experimental results show that it surpasses state-of-the-art methods in terms of robustness, computational efficiency, localization accuracy, and GPU memory usage.
- Research Article
11
- 10.3390/s24185929
- Sep 12, 2024
- Sensors (Basel, Switzerland)
Feature points from moving objects can negatively impact the accuracy of Visual Simultaneous Localization and Mapping (VSLAM) algorithms, while detection or semantic segmentation-based VSLAM approaches often fail to accurately determine the true motion state of objects. To address this challenge, this paper introduces DIO-SLAM: Dynamic Instance Optical Flow SLAM, a VSLAM system specifically designed for dynamic environments. Initially, the detection thread employs YOLACT (You Only Look At CoefficienTs) to distinguish between rigid and non-rigid objects within the scene. Subsequently, the optical flow thread estimates optical flow and introduces a novel approach to capture the optical flow of moving objects by leveraging optical flow residuals. Following this, an optical flow consistency method is implemented to assess the dynamic nature of rigid object mask regions, classifying them as either moving or stationary rigid objects. To mitigate errors caused by missed detections or motion blur, a motion frame propagation method is employed. Lastly, a dense mapping thread is incorporated to filter out non-rigid objects using semantic information, track the point clouds of rigid objects, reconstruct the static background, and store the resulting map in an octree format. Experimental results demonstrate that the proposed method surpasses current mainstream dynamic VSLAM techniques in both localization accuracy and real-time performance.
- Research Article
1
- 10.48084/etasr.13116
- Dec 8, 2025
- Engineering, Technology & Applied Science Research
Visual Simultaneous Localization and Mapping (SLAM) has become a cornerstone in the development of intelligent systems capable of perceiving and interacting with their environment in real time. This survey presents a comprehensive review of recent advances in visual SLAM algorithms, with a focus on their classification, performance characteristics, and application domains. This study categorizes existing methods into monocular, stereo, RGB-D, and multi-sensor/hybrid approaches, analyzing key contributions such as ORB-SLAM, DSO, ElasticFusion, and VINS-Mono. Each class is evaluated in terms of accuracy, robustness, and computational efficiency while highlighting the trade-offs associated with different sensor modalities. Additionally, this study explores cross-modal and deep learning-based hybrid SLAM systems, which incorporate semantic understanding, motion segmentation, and sensor fusion to enhance performance in complex and dynamic environments. Application areas, including robotics, augmented/virtual reality, 3D mapping, and wearable technologies, are discussed to underscore the practical relevance of visual SLAM. Finally, the survey outlines the main challenges and future directions, including lifelong mapping, real-time performance on edge devices, semantic integration, and the emergence of SLAM 2.0 systems. This work aims to serve as a resource for researchers and practitioners seeking to understand the state of the art and guide future innovation in the field of visual SLAM.
- Research Article
1
- 10.3390/aerospace12070642
- Jul 20, 2025
- Aerospace
This paper presents a visual active SLAM method considering measurement and state uncertainty for space exploration in urban search and rescue environments. An uncertainty evaluation method based on the Fisher Information Matrix (FIM) is studied from the perspective of evaluating the localization uncertainty of SLAM systems. With the aid of the Fisher Information Matrix, the Cramér–Rao Lower Bound (CRLB) of the pose uncertainty in the stereo visual SLAM system is derived to describe the boundary of the pose uncertainty. Optimality criteria are introduced to quantitatively evaluate the localization uncertainty. The odometry information selection method and the local bundle adjustment information selection method based on Fisher Information are proposed to find out the measurements with low uncertainty for localization and mapping in the search and rescue process. By adopting the method above, the computing efficiency of the system is improved while the localization accuracy is equivalent to the classical ORB-SLAM2. Moreover, by the quantified uncertainty of local poses and map points, the generalized unary node and generalized unary edge are defined to improve the computational efficiency in computing local state uncertainty. In addition, an active loop closing planner considering local state uncertainty is proposed to make use of uncertainty in assisting the space exploration and decision-making of MAV, which is beneficial to the improvement of MAV localization performance in search and rescue environments. Simulations and field tests in different challenging scenarios are conducted to verify the effectiveness of the proposed method.
- Research Article
5
- 10.3390/app142210727
- Nov 20, 2024
- Applied Sciences
This work presents a novel RGB-D dynamic simultaneous localization and mapping (SLAM) method that improves accuracy, stability, and efficiency of localization while relying on deep learning in a dynamic environment, in contrast to traditional static scene-based visual SLAM methods. Based on the classic framework of traditional visual SLAM, we propose a method that replaces the traditional feature extraction method with a convolutional neural network approach, aiming to enhance the accuracy of feature extraction and localization, as well as to improve the algorithm’s ability to capture and represent the characteristics of the entire scene. Subsequently, the semantic segmentation thread was utilized in a target detection network combined with geometric methods to identify potential dynamic areas in the image and generate masks for dynamic objects. Finally, the standard deviation of the depth information of potential dynamic points was calculated to identify true dynamic feature points, to guarantee that static feature points were used for position estimation. We performed experiments based on the public datasets to validate the feasibility of the proposed algorithm. The experimental results indicate that the improved SLAM algorithm, which boasts a reduction in absolute trajectory error (ATE) by approximately 97% compared to traditional static visual SLAM and about 20% compared to traditional dynamic visual SLAM, also exhibited a 68% decrease in computation time compared to well-known dynamic visual SLAM, thereby possessing absolute advantages in both positioning accuracy and operational efficiency.
- Research Article
3
- 10.1371/journal.pone.0312358
- Dec 30, 2024
- PloS one
Accurate localization is a critical technology for the application of intelligent robots and automation systems in complex indoor environments. Traditional visual SLAM (Simultaneous Localization and Mapping) techniques often face challenges with localization accuracy in high similarity scenes. To address this issue, this paper proposes an improved visual SLAM loop closure detection algorithm that integrates deep learning techniques. Using the TUM f3 loh, Lip6 Indoor, and Bicocca Indoor datasets as experimental bases, a detailed comparison of the proposed algorithm against other methods was conducted across various evaluation metrics. The experimental results show that the proposed loop closure detection algorithm significantly outperforms traditional methods in terms of localization accuracy in high similarity scenes. Specifically, the detection accuracy rates for the TUM f3 loh, Lip6 Indoor, and Bicocca Indoor datasets were 66.67%, 72.72%, and 80.00%, respectively, representing an approximate 18% improvement over the average accuracy of ORB-SLAM2. Additionally, the proposed method demonstrated excellent performance in trajectory error, with a root mean square error (RMSE) of just 0.0816m on the Bicocca Indoor dataset, significantly lower than the 0.1341m RMSE of ORB-SLAM2. Furthermore, improvements in feature extraction and matching mechanisms greatly reduced the occurrence of mismatches, enhancing the system's adaptability for more accurate localization and navigation in complex indoor environments. The proposed method effectively enhances localization accuracy and system practicality in visually similar indoor environments, offering a new direction for the development of visual SLAM technology and holding significant application potential in intelligent robots and indoor navigation systems.
- Research Article
6
- 10.3233/jifs-222199
- Nov 9, 2022
- Journal of Intelligent & Fuzzy Systems
Gas gathering pipeline network system is an important process facility for gas field production, which is responsible for collecting, transporting and purifying natural gas produced by wells. In this paper, an optimization model for the layout of star-tree gas gathering pipeline network in discrete space is established to find the most economical design scheme. The decision variables include valve set position, station position and pipeline connection relation. A series of equality and inequality constraints are developed, including node flow balance constraints, pipeline hydraulic constraints and pipeline structure constraints. A global optimization strategy is proposed and an improved genetic algorithm is used to solve the model. To verify the validity of the proposed method, the optimization model is applied to a coalbed methane field gathering pipeline network in China. The results show that the global optimization scheme saves 1489.74×10 4 RMB (26.36%) in investment cost compared with the original scheme. In addition, the comparison between the global and hierarchical optimization scheme shows that the investment cost of the global optimization scheme is 567.22×10 4 RMB less than that of the hierarchical optimization scheme, which further proves the superiority of the global optimization method. Finally, the study of this paper can provide theoretical guidance for the design and planning of gas field gathering pipeline network.
- Conference Article
7
- 10.1145/3410463.3414636
- Sep 30, 2020
This paper proposes a methodology for exploiting approximate computing to reduce the time and energy requirements of Simultaneous Localization and Mapping (SLAM) algorithms, which are used in important problem domains like robotics and autonomous driving in which autonomous agents navigate through unknown environments. Algorithms for SLAM use sensors to probe the environment, integrate this information into a map of the surroundings (mapping), and determine where the agent is in this map (localization). Visual SLAM algorithms use cameras as sensors. They can be used in places where GPS information is not available, %such as inside buildings, but they have high computational requirements, leading to poor performance and high energy usage on embedded platforms. Existing studies of approximation in SLAM have mostly used offline control, which requires the trajectory be known before the agent starts to move. This is not realistic in most SLAM applications. In this paper, we present a general methodology for applying principled online approximation to visual SLAM algorithms. We implemented our proposed methodology in four visual SLAM algorithms (including one visual inertial SLAM algorithm) and evaluated them on several platforms. Our experimental results show that across different algorithms and platforms, our proposed methodology results in savings of up to 77% and 40% in computation time and energy consumption respectively with acceptable quality loss in localization and mapping accuracy over a variety of inputs.
- Research Article
43
- 10.1109/access.2020.2994348
- Jan 1, 2020
- IEEE Access
Simultaneous Localization and Mapping (SLAM) has been widely applied in computer vision and robotics. For the dynamic environments which are very common in the real word, traditional visual SLAM system faces significant drop in localization and mapping accuracy due to the static world assumption. Recently, the semantic visual SLAM systems towards dynamic scenes have gradually attracted more and more attentions, which use the semantic information of images to help remove dynamic feature points. Existing semantic visual SLAM systems commonly detect the dynamic feature points by the semantic prior, geometry constraint or the combine of them, then map points corresponding to dynamic feature points are removed. In the visual SLAM framework, pose calculation is essentially around the 3D map points, so the essence of improving the accuracy of visual SLAM system is to build a more accurate and reliable map. These existing semantic visual SLAM systems are actually adopting an indirect way to acquire reliable map points, and several drawbacks exist. In this paper, we present SDF-SLAM: Semantic Depth Filter SLAM, a visual semantic SLAM system towards dynamic environments, which utilizes the technology of depth filter to directly judge whether a 3D map point is dynamic or not. First, the semantic information is integrated into the original pure geometry SLAM system by the semantic optical flow method to perform reliable map initialization. Second, design the semantic depth filter that satisfies the Gaussian Uniform mixture distribution to describe the inverse depth of each map point. Third, updating the inverse depth of 3D map point in a Bayesian estimation framework, and dividing the 3D map point into active one or inactive one. Last, only the active map points are utilized to achieve robust camera pose tracking. Experiments on TUM dataset demonstrate that our approach outperforms original ORB-SLAM2 and other state-of-the-art semantic SLAM systems.
- Research Article
35
- 10.1007/s10514-023-10110-y
- Jul 4, 2023
- Autonomous Robots
In this research, we proposed a stereo visual simultaneous localisation and mapping (SLAM) system that efficiently works in agricultural scenarios without compromising the performance and accuracy in contrast to the other state-of-the-art methods. The proposed system is equipped with an image enhancement technique for the ORB point and LSD line features recovery, which enables it to work in broader scenarios and gives extensive spatial information from the low-light and hazy agricultural environment. Firstly, the method has been tested on the standard dataset, i.e., KITTI and EuRoC, to validate the localisation accuracy by comparing it with the other state-of-the-art methods, namely VINS-SLAM, PL-SLAM, and ORB-SLAM2. The experimental results evidence that the proposed method obtains superior localisation and mapping accuracy than the other visual SLAM methods. Secondly, the proposed method is tested on the ROSARIO dataset, our low-light agricultural dataset, and O-HAZE dataset to validate the performance in agricultural environments. In such cases, while other methods fail to operate in such complex agricultural environments, our method successfully operates with high localisation and mapping accuracy.
- Research Article
- 10.1177/09544070251349360
- Aug 12, 2025
- Proceedings of the Institution of Mechanical Engineers, Part D: Journal of Automobile Engineering
Visual SLAM is capable of achieving the localization and mapping functions for intelligent vehicles. However, in practical applications, the presence of complex environments significantly affects localization accuracy and real-time performance. To address these challenges, this paper proposes a dynamic scene visual SLAM method that fuses image segmentation and feature following. This method utilizes an object detection model to quickly identify dynamic regions within image, actively removes unstable dynamic feature points. By employing a clustering algorithm to partition map points and efficiently extract feature-rich regions, this approach avoids interference from dynamic objects. Simultaneously, it utilizes an optical flow algorithm to continuously track static feature-rich regions in subsequent images, mitigating the influence of dynamic pixels and achieving efficient feature point extraction. Experimental results demonstrate that in dynamic environments, the proposed method improves localization accuracy by 70% and reduces feature extraction time per frame by 45% compared to the ORB-SLAM2 system. In most cases, our method also outperforms DS-SLAM, which proves that the proposed method can effectively improve the localization accuracy and the real-time performance of feature extraction in dynamic scenes for visual SLAM.
- Research Article
17
- 10.1109/tmech.2021.3085326
- Jun 1, 2022
- IEEE/ASME Transactions on Mechatronics
The indirect visual simultaneous localization and mapping (VSLAM) is widely used in robot localization and navigation, thanks to its potential to achieve high localization accuracy with the local feature observations. However, the existing local features are subject to drift and mismatches under various visual conditions, which causes a degrading in localization accuracy and tracking loss. This article proposes a quantized self-supervised local feature for the indirect VSLAM to handle the environmental interference in robot localization tasks. A joint feature detection and description network is built in a lightweight manner to extract local features in real time. The network is iteratively trained by a self-supervised learning strategy, and the extracted local features are quantized by an orthogonal transformation for efficiency. We utilize frame-wise matching in Hamming space and bundle adjustment to establish a parallel indirect VSLAM. The proposed VSLAM demonstrates outstanding localization accuracy and tracking stability in the evaluation on multiple datasets and robustness in real-world experiments with the Realsense D435 RGB-D sensor. The efficiency experiment on Jetson TX2 indicates that the quantized self-supervised local feature is suitable for feature-based tasks on edge computing platforms.
- Conference Article
19
- 10.1109/icra.2014.6907415
- May 1, 2014
This paper reports on a method for an autonomous underwater vehicle to perform real-time visual simultaneous localization and mapping (SLAM) on large ship hulls over multiple sessions. Along with a monocular camera, our method uses a piecewise-planar model to explicitly optimize the ship hull surface in our factor-graph framework, and anchor nodes to co-register multiple surveys. To enable realtime performance for long-term SLAM, we use the recent Generic Linear Constraints (GLC) framework to sparsify our factor-graph. This paper analyzes how our single-session SLAM techniques can be used in the GLC framework, and describes a particle filter reacquisition algorithm so that an underwater session can be automatically re-localized to a previously built SLAM graph. We provide real-world experimental results involving automated ship hull inspection, and show that our localization filter out-performs Fast Appearance-Based Mapping (FAB-MAP), a popular place-recognition system. Using our approach, we can automatically align surveys that were taken days, months, and even years apart.
- Research Article
- 10.1371/journal.pone.0328052
- Jul 18, 2025
- PLOS One
Visual Simultaneous Localization and Mapping (VSLAM) is the key technology for autonomous navigation of mobile robots. However, feature-based VSLAM systems still face two major challenges in dynamic complex environments: insufficient feature reliability and significant dynamic interference, urgently requiring improved matching robustness. This paper innovatively proposes a dynamic adaptive VSLAM system based on the High-repeatability and High-reliability feature matching network (2HR-Net), which improves localization accuracy in dynamic environments through three key innovations: First, the 2HR feature detection network is designed, integrating the K-Means clustering algorithm into L2-Net to achieve feature point detection with both high repeatability and high reliability. Second, the lightweight YOLOv8n model is integrated to detect and remove feature points in dynamic regions in real-time, effectively reducing the impact of dynamic interference on pose estimation. Finally, the shared matching Siamese network with a unique dual-branch feature fusion strategy and similarity optimization algorithm is proposed to enhance the accuracy of feature matching. The proposed algorithm was ultimately validated using the publicly available TUM dataset. The experimental results show that the feature detection method proposed in this paper achieved a repeatability rate of approximately 70% in various dynamic scenarios, which is significantly higher than traditional methods (such as ORB-SLAM3), whose repeatability typically falls below 40%. In addition, compared with ORB-SLAM3, the root mean square error (RMSE) and standard deviation (S.D.) of the Absolute Trajectory Error (ATE) in various dynamic scenarios were reduced by approximately 90%, indicating higher localization accuracy and stability. Therefore, the experimental results demonstrate that the proposed method outperforms mainstream methods such as ORB-SLAM3 in terms of feature repeatability, matching accuracy, and localization precision, providing an effective solution for robust VSLAM in dynamic environments.