Self-Supervised Visual Descriptor Learning for Dense Correspondence

Tanner Schmidt,Richard Newcombe,Dieter Fox

doi:10.1109/lra.2016.2634089

Abstract

Robust estimation of correspondences between image pixels is an important problem in robotics, with applications in tracking, mapping, and recognition of objects, environments, and other agents. Correspondence estimation has long been the domain of hand-engineered features, but more recently deep learning techniques have provided powerful tools for learning features from raw data. The drawback of the latter approach is that a vast amount of (labeled, typically) training data are required for learning. This paper advocates a new approach to learning visual descriptors for dense correspondence estimation in which we harness the power of a strong three-dimensional generative model to automatically label correspondences in RGB-D video data. A fully convolutional network is trained using a contrastive loss to produce viewpoint- and lighting-invariant descriptors. As a proof of concept, we collected two datasets: The first depicts the upper torso and head of the same person in widely varied settings, and the second depicts an office as seen on multiple days with objects rearranged within. Our datasets focus on revisitation of the same objects and environments, and we show that by training the CNN only from local tracking data, our learned visual descriptor generalizes toward identifying nonlabeled correspondences across videos. We furthermore show that our approach to descriptor learning can be used to achieve state-of-the-art single-frame localization results on the MSR 7-scenes dataset without using any labels identifying correspondences between separate videos of the same scenes at training time.

Full Text