TransBoNet: Learning camera localization with Transformer Bottleneck and Attention

Xiaogang Song,Hongjuan Li,Li Liang,Weiwei Shi,Guo Xie,Xiaofeng Lu,Xinhong Hei

doi:10.1016/j.patcog.2023.109975

Abstract

6DoF camera localization is an important component of autonomous driving and navigation. Deep learning has achieved impressive results in localization, but its robustness in dynamic environments has not been adequately addressed. In this paper, we propose a framework based on hybrid attention mechanism which can be generally applied to existing CNN-based pose regressors to improve their robustness in dynamic environments. Specifically, we propose a novel Transformer Bottleneck (TBo) block including convolution, channel attention, and a position-aware self-attention mechanism, which extracts more geometrically robust features by capturing the corresponding long-term dependencies between pixels. Furthermore, we introduce shuffle attention (SA) before the pose regressor, which integrates feature information in both spatial and channel dimensions, forcing the network to learn geometrically robust features, reducing the effects of dynamic objects and illumination conditions to improve camera localization accuracy. We evaluate our method on commonly benchmarked indoor and outdoor datasets and the experimental results show that our proposed method can significantly improve localization performance compared compare favorably to contemporary pose regressors schemes. In addition, extensive ablation evaluations are conducted to prove the effectiveness of our proposed hybrid attention bottleneck block for pose regression networks.

Full Text