Cross-view SLAM solver: Global pose estimation of monocular ground-level video frames for 3D reconstruction using a reference 3D model from satellite images

Mostafa Elhashash,Rongjun Qin

doi:10.1016/j.isprsjprs.2022.03.018

Mostafa Elhashash, Rongjun Qin

Open Access

https://doi.org/10.1016/j.isprsjprs.2022.03.018

Copy DOI

Abstract

Accurate pose estimation of monocular ground-level images with respect to a satellite/aerial photogrammetric dataset is an extremely challenging task. Existing solutions often perform an offline post-registration on 3D results from both sources, which however, suffers from non-rigid geometric distortions of the 3D monocular reconstruction and the lack of overlaps between the air to ground content. This paper provides an online solution that performs accurate pose estimation of the ground images with respect to a 3D model derived from satellite images, followed by a dense 3D reconstruction. Our solution takes advantage of the simultaneous localization and mapping (SLAM) paradigm to dynamically incorporate reference observations from the satellite 3D model during the incremental pose estimation, called a cross-view SLAM solver, which leverages both ground-to-satellite error and image-level reprojection errors at the frame level to yield image poses that are well-registered to the satellite 3D model for facade point cloud reconstruction. This process also has the advantage of correcting non-rigid distortions and trajectory drifts that are often presented in monocular SLAM systems. In addition, our solution leverages both the geometric and semantic information from the satellite model and ground images to perform a per-frame correction for frame-level pose initialization, in which a novel scheme called pose buffer is introduced to initialize the pose of each keyframe through robust visual hull alignment of ground objects. The proposed approach has been experimented using four trajectories of monocular videos collections (around 7,000 frames per trajectory on average) and a 3D semantic model from multi-view satellite images to estimate the poses of the video frames and yield point clouds consistent with the satellite 3D models, evaluated by using LiDAR ground-truth. Both qualitative and quantitative experiments demonstrate that our solution yields accurate, drift-free poses and point clouds consistent with the satellite data and visually much more pleasing 3D models with facade information. Compared to the LiDAR ground-truth, the derived 3D models with ground-level images have achieved a mean absolute error of 1.78 m (improved from 3.15 m achieved using SLAM without utilizing satellite 3D models) (A testing program will be made available through https://github.com/GDAOSU/Cross-View-SLAM).

Full Text