Abstract

Visual relocalization aims to estimate the absolute camera pose from an image or sequential images. Recent works tackle this problem by exploiting deep neural networks to regress camera poses. However, spatial and temporal clues from sequential images still remain underexplored, resulting in inaccurate poses and large outliers. In this work, we introduce a novel vision Transformer based absolute pose regression model, TransAPR, to tackle this problem. Upon the traditional CNN backbone, we design Transformer based spatial and temporal fusion modules respectively to realize sufficient feature interaction among the neighboring images in the sequence. A hierarchical feature aggregation (HFA) module is further designed to aggregate multi-scale and multi-level features in the pose regressor. Benefiting from these delicate designs, our model is able to generate reliable image representations for absolute pose regression, resulting in more robust localization under challenging environments. We conduct extensive experiments on various indoor and outdoor datasets and show that our method achieves state-of-the-art performance.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call