Contemporary state-of-the-art localization methods perform feature matching against a structured scene model or learn to regress the scene 3D coordinates. The resulting matches between 2D query pixels and 3D scene coordinates are used to estimate the camera pose using PnP and RANSAC, requiring the camera intrinsics for both the query and reference images. An alternative approach is to directly regress the camera pose from the query image. Although less accurate, absolute camera pose regression does not require any additional information at inference time and is typically lightweight and fast. Recently, Transformers were proposed for learning multi-scene camera pose regression, employing encoders to attend to spatially varying deep features while using decoders to embed multiple scene queries at once. In this work, we show that Transformer Encoders can aggregate and extract task-informative latent representations for learning both single- and multi- scene camera pose regression, without Transformer-Decoders. Our approach is shown to reduce the runtime and memory of previous Transformer-based multi-scene solutions, while comparing favorably with contemporary pose regression schemes and achieving state-of-the-art accuracy on multiple indoor and outdoor regression benchmarks. In particular, to the best of our knowledge, our approach is the first absolute regression approach to attain sub-meter average accuracy across outdoor scenes. We make our code publicly available at: https://github.com/yolish/transposenet.
Read full abstract