TECD_Attention: Texture-enhanced and cross-domain attention modeling for visual place recognition

Zhenyu Li,Zhenbiao Dong

doi:10.1016/j.cviu.2024.103929

Abstract

Visual place recognition (VPR) is a challenging task for visual computing in the field of robot navigation. However, most of the existing methods fail to learn the most salient features of place images by simple CNN feature or popular Transformer feature due to the inconsistency problem commonly existing in VPR datasets, which limits the robustness and interpretability of the model. In addition, existing state-of-the-art methods only capture general features of original places with multi-scale CNN or transformer features and ignore texture characteristics existing in place images, resulting in suboptimal recognition performance. To cope with the above issues, we propose a novel visual place recognition network, named Texture-enhanced Cross-domain Attention Transformer (TECD_Attention). Specially, a cross-attention Transformer is first used for fusing deep attentive local and global features to improve the multi-scale feature representation of the recognition model. Second, a texture-enhanced cross-domain attention block is designed to construct the final feature descriptor by fusing texture features and attentive local–global features. Then, a tripled loss function is used for matching top-ranked reference places from the place database to a query place. Last, effective and efficient place re-ranking is achieved by training an adapted weakly supervised re-ranking network relying on the similarity computing between the query place and the top-ranked places. Our approach is carried out in extensive experiments on four challenging datasets. Our model has achieved 96.2%, 94.6%, 95.9%, and 96.8% average recall based on top 1% Candidate scenario on Tokyo 24/7, Pitts250k, VPRiCE, and SUN397 datasets, respectively. Therefore, Compared with the existing state-of-the-art VPR methods, TECD_Attention performs superior on robot place recognition in challenging environments. Hence, we can conclude that this is a robust model for robot visual place recognition in challenging environments.

Full Text