Accurate localization of pedestrians and mobile robots is critical for navigation, emergency response, and autonomous driving. Traditional localization methods, such as satellite signals, often prove ineffective in certain environments, and acquiring sufficient positional data can be challenging. Single image localization techniques have been developed to address these issues. However, current deep learning frameworks for single image localization that rely on domain adaptation fail to effectively utilize semantically rich high-level features obtained from large-scale pretraining. This paper introduces a novel framework that leverages the Contrastive Language-Image Pre-training model and prompts to enhance feature extraction and domain adaptation through semantic information. The proposed framework generates an integrated score map from scene-specific prompts to guide feature extraction and employs adversarial components to facilitate domain adaptation. Furthermore, a reslink component is incorporated to mitigate the precision loss in high-level features compared to the original data. Experimental results demonstrate that the use of prompts reduces localization errors by 26.4 % in indoor environments and 24.3 % in outdoor settings. The model achieves localization errors as low as 0.75 m and 8.09 degrees indoors, and 4.56 m and 7.68 degrees outdoors. Analysis of prompts from labeled datasets confirms the model’s capability to effectively interpret scene information. The weights of the integrated score map enhance the model’s transparency, thereby improving interpretability. This study underscores the efficacy of integrating semantic information into image localization tasks.
Read full abstract