Simultaneous localization and mapping (SLAM) is one of the cornerstones of autonomous navigation systems in robotics and the automotive industry. Visual SLAM (V-SLAM), which relies on image features, such as keypoints and descriptors to estimate the pose transformation between consecutive frames, is a highly efficient and effective approach for gathering environmental information. With the rise of representation learning, feature detectors based on deep neural networks (DNNs) have emerged as an alternative to handcrafted solutions. This work examines the integration of sparse learned features into a state-of-the-art SLAM framework and benchmarks handcrafted and learning-based approaches by comparing the two methods through in-depth experiments. Specifically, we replace the ORB detector and BRIEF descriptor of the ORBSLAM3 pipeline with those provided by Superpoint, a DNN model that jointly computes keypoints and descriptors. Experiments on three publicly available datasets from different application domains were conducted to evaluate the pose estimation performance and resource usage of both solutions.
Read full abstract