Visual place recognition is a fundamental component in autonomous systems and robotics, which is easily limited in the real world with different viewpoints and changes in appearance. Existing approaches to tackle this problem mainly rely on dominant CNN-based architectures, which are difficult to model global correlation. Most recently, there has emerged little work that focuses on the effectiveness of the Transformer in modeling long-range dependencies, but this strategy ignores local interactions thus fails to localize the really important regions. To address the above issue, this paper proposes an effective Transformer-based architecture that takes full advantages of the strengths of Transformer related to global context modeling and local specific region capturing. We first design a dual-level Transformer descriptor encoder to successively perform self-attention within local windows and global extent of the CNN feature map to obtain multi-scale spatial context, which combines local interaction and global information. Specifically, multi-layer classification tokens from the Transformer encoder are integrated to form the global image representation. Moreover, a Transformer-guided geometric verification module is introduced to leverage the strengths of the hierarchical Transformer’s inherent self-attention mechanism for fusing multi-level attention, which is employed to filter the output token to obtain key patches and associated attention weights for achieving spatial matching. Finally, we propose a descriptor refinement strategy that employs fine-grained region-level supervisions to further enhance the capability of the network to learn local discriminative features, which effectively alleviates the confusion caused from weak image-level labels. Extensive experiments on benchmark datasets show that our approach outperforms state-of-the-art methods with promising performance.
Read full abstract