Abstract

This study addresses the challenge of visual localization using monocular images, a crucial technology for autonomous systems that facilitates their navigation and interaction capabilities. With the advent of deep learning, visual localization techniques that utilize these methods have demonstrated improved robustness across diverse environments. Existing end-to-end models apply convolutional neural networks (CNNs) to extract salient features and directly estimate continuous spatial poses from map models that allow for implicit differentiation. Nonetheless, these models often falter in adapting their feature representations to extreme variations in environmental conditions, leading to critical localization inaccuracies during episodes of altered lighting, varying weather, or in the presence of moving objects. To overcome these limitations, we introduce the end-to-end feature refinement network for visual localization (EFRNet-VL). This network architecture is specifically designed to prioritize the extraction of static features crucial for the six degrees of freedom (6DoF) pose estimation, thereby outperforming prior methodologies. EFRNet-VL meticulously integrates a convolutional network structure with self-attention mechanisms and Long Short-Term Memory (LSTM) modules, which together facilitate the accurate association of a single image with its corresponding camera pose, even within dynamic environments. The proposed feature refinement approach is straightforward to implement and can enhance the performance of existing neural pose estimators. Our comprehensive evaluations of EFRNet-VL underscore its effectiveness. Notably, it has diminished the average position and orientation errors by 54.5% and 25.7%, respectively, as compared to the popular PoseNet model across various indoor settings. Moreover, in large-scale outdoor environments, it has achieved an average localization precision of 7.02m/2.79°. EFRNet-VL has set a new benchmark for end-to-end learning-based methods in visual localization and operates efficiently in real time, processing at a speed of 9.8 ms per image frame.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.