Coarse-to-Fine Multi-Scene Pose Regression with Transformers.

Yoli Shavit,Yosi Keller,Ron Ferens

doi:10.1109/tpami.2023.3310929

Yoli Shavit, Yosi Keller + Show 1 more

Open Access

https://doi.org/10.1109/tpami.2023.3310929

Copy DOI

Abstract

Absolute camera pose regressors estimate the posi-tion and orientation of a camera given the captured image alone. Typically, a convolutional backbone with a multi-layer perceptron (MLP) head is trained using images and pose labels to embed a single reference scene at a time. Recently, this scheme was extended to learn multiple scenes by replacing the MLP head with a set of fully connected layers. In this work, we propose to learn multi-scene absolute camera pose regression with Transformers, where encoders are used to aggregate activation maps with self-attention and decoders transform latent features and scenes encoding into pose predictions. This allows our model to focus on general features that are informative for localization, while em-bedding multiple scenes in parallel. We extend our previous MS-Transformer approach [1] by introducing a mixed classification-regression architecture that improves the localization accuracy. Our method is evaluated on commonly benchmark indoor and outdoor datasets and has been shown to exceed both multi-scene and state-of-the-art single-scene absolute pose regressors. We make our code publicly available from here.

Full Text