Depth as attention to learn image representations for visual localization, using monocular images

Dulmini Hettiarachchi,Ye Tian,Han Yu,Shunsuke Kamijo

doi:10.1016/j.jvcir.2023.104012

Abstract

Image retrieval algorithms are widely used in visual localization tasks. In visual localization, we can benefit from retrieving the images depicting same landmark taken from a pose similar to the query. However, state-of-the-art image retrieval algorithms are optimized mainly for landmark retrieval, and do not take camera pose into account. To address this limitation, we propose novel Depth Attention Network (DeAttNet). DeAttNet leverages both visual and depth information in learning a global image representation. Depth varies for similar features captured from different camera poses. Based on this insight, we employ depth within an attention mechanism to discern and emphasize the salient regions. In our method, we utilize monocular depth estimation algorithms to render depth maps. Compared to RGB only image descriptors, significant improvements are obtained with the proposed method on Mapillary Street Level Sequences, Pittsburgh and Cambridge Landmark datasets.

Full Text