Handheld monocular object reconstruction: uniting photogrammetry, silhouettes, and scale

Christopher Ham

doi:10.14264/uql.2020.797

Abstract

Detailed three-dimensional measurements and reconstructions of the world are fundamentally and immediately useful in professions such as engineer- ing, architecture, medicine, computer graphics, and art. A three dimen- sional understanding of the world in which we exist provides rich cues that inform a bevy of essential tasks in computer vision and robotics such as object recognition, tracking, visual perception, navigation, object picking, and manipulation.The problem of reconstructing or mapping the world from images fun- damentally involves the path taken by light – from the world, through a lens, to the imaging sensor. While this has been approached in a number of distinct ways – including Multi-View Stereo (MVS), Structure from Mo- tion (SfM), and Simultaneous Localisation and Mapping (SLAM) – each with their own priorities and assumptions.The existence of these more focused fields, however, has created a false dichotomy that real-time algorithms are for ordered images, and batch methods are for unordered images. In fact, in the age of high quality per- sonal devices, there exists an entire space between in which video data is available without real-time constraints. Beyond merely making the recon- struction problem easier, it is at this previously unexplored eutectic point that this thesis advances the capabilities of dense, handheld, object-centric reconstruction using a monocular video.Recent disruptions in SLAM and visual odometry have led to a deeper understanding of the use of features (indirect) versus raw pixels (direct) to align views, and consequently the gradual erosion of the perceived short- comings of direct methods. By additionally solving for the orientation of surface patches and the camera’s intrinsics to a radial distortion model, we demonstrate that direct methods can also be effectively applied to uncali- brated SfM problems. Borrowing from SLAM, the small motions between video frames allows efficient initialisation of view poses and scene structure for new keyframes without feature matching – as required for larger base- lines. We show that our direct SfM pipeline over twice as accurate as the state of the art feature-based pipeline, COLMAP.Given the ability to accurately track a camera’s pose from handheld video, obtaining a dense reconstruction becomes an MVS problem. Ex- isting algorithms are designed for a sparse set of high resolution images, fusing dense stereo depth maps and minimising photometric reprojection errors. Without an extraordinarily complex model of light and material physics, these algorithms can never recover the surface of non-Lambertian materials. We investigate how the evolution of occlusion boundaries in a video sequence can reveal the shape of the underlying surface. We propose an approach for reconstruction glass and metal surfaces, whose density is dependent on the spatio-angular resolution of the views – rather than the image resolution.The absolute scale of a dense reconstruction remains the one unobserv- able property with a monocular camera. Methods designed for real-time robotics use additional sensors, such as GPS or an accelerometer, to up- date pose and scale estimates on the fly. In contrast we assert that in the absence of real-time constraints it is best to rely on vision for fully recon- structing the scene up to scale, using additional sensors to estimate only this last remaining value. With a goal of achieving this indoors, we do so with a single accelerometer.In unifying the work in this thesis, we demonstrate the ability to take measurements of dense 3D reconstruction of challenging objects with less than 0.2mm accuracy.

Full Text