Abstract
Visual positioning is the task of finding the location of a given image and is necessary for augmented reality applications. Traditional algorithms solve this problem by matching against premade 3D point clouds or panoramic images. Recently, more attention has been given to models that match the ground-level image with overhead imagery. In this paper, we introduce AlignNet, which builds upon previous work to bridge the gap between ground-level and top-level images. By making multiple key insights, we push the model results to achieve up to 4 times higher recall rates on a visual position dataset. We use a fusion of both satellite and map data from OpenStreetMap for this matching by extending the previously available satellite database with corresponding map data. The model pushes the input images through a two-branch U-Net and is able to make matches using a geometric projection module to map the top-level image to the ground-level domain at a given position. By calculating the difference between the projection and ground-level image in a differentiable fashion, we can use a Levenberg–Marquardt (LM) module to iteratively align the estimated position towards the ground-truth position. This sample-wise optimization strategy allows the model to align the position better than if the model has to obtain the location in a single step. We provide key insights into the model’s behavior, which allows us to increase the model’s ability to obtain competitive results on the KITTI cross-view dataset. We compare our obtained results with the state of the art and obtain new best results on 3 of the 9 categories we look at, which include a 57% likelihood of lateral localization within 1 m in a 40 m × 40 m area and a 93% azimuth localization within 3∘ when using a 20∘ rotation noise prior.
Published Version (Free)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have