Abstract

In feature-based visual localization for small-scale scenes, local descriptors are used to estimate the camera pose of a query image. For large and ambiguous environments, learning-based hierarchical networks that employ local as well as global descriptors to reduce the search space of database images into a smaller set of reference views have been introduced. However, since global descriptors are generated using visual features, reference images with some of these features may be erroneously selected. In order to address this limitation, this paper proposes two clustering methods based on how often features appear as well as their covisibility. For both approaches, the scene is represented by voxels whose size and number are computed according to the size of the scene and the number of available 3D points. In the first approach, a voxel-based histogram representing highly reoccurring scene regions is generated from reference images. A meanshift is then employed to group the most highly reoccurring voxels into place clusters based on their spatial proximity. In the second approach, a graph representing the covisibility-based relationship of voxels is built. Local matching is performed within the reference image clusters, and a perspective-n-point is employed to estimate the camera pose. The experimental results showed that camera pose estimation using the proposed approaches was more accurate than that of previous methods.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call