Rapid Structure from Motion Frame Selection for Markerless Monocular SLAM

Blake Troutman,Mihran Tuceryan

doi:10.1007/978-3-031-19650-8_9

Abstract

Simultaneous Localization and Mapping (SLAM) is a method to simultaneously map the environment and estimate the location and pose of an agent. Visual SLAM utilizes visual data from a camera to accomplish the same. Markerless visual SLAM systems that use natural features in images and make use of multiple view geometry methods require an initialization procedure that typically involves using structure from motion techniques to solve for the first pair of keyframes and subsequently triangulating the initial set of 3-space points with the resulting keyframes. Single-camera (monocular) systems that use these techniques need to select appropriate frames for the structure from motion process from the camera’s pre-initialization motion, but this frame selection problem is non-trivial. This is due to the fact that frame suitability is largely determined by the cameras’ baseline length and the coplanarity of the scene points, both of which are unknown prior to initialization. This work presents a set of logistic regression and neural network models that can be used to select good frames for the structure from motion process by using a small set of summary data related to the matched correspondences. These models are trained on 14 different sequences from the TUM RGB-D dataset (totaling over 23,000 samples) without the need for any ground truth pose information. The resulting models demonstrate significantly higher precision and recall than the previous state-of-the-art frame selection techniques while still being able to run in real-time and function for both coplanar and non-coplanar scene points.

Full Text