Abstract

The success of monocular depth estimation relies on large and diverse training sets. Due to the challenges associated with acquiring dense ground-truth depth across different environments at scale, a number of datasets with distinct characteristics and biases have emerged. We develop tools that enable mixing multiple datasets during training, even if their annotations are incompatible. In particular, we propose a robust training objective that is invariant to changes in depth range and scale, advocate the use of principled multi-objective learning to combine data from different sources, and highlight the importance of pretraining encoders on auxiliary tasks. Armed with these tools, we experiment with five diverse training datasets, including a new, massive data source: 3D films. To demonstrate the generalization power of our approach we use zero-shot cross-dataset transfer, i.e. we evaluate on datasets that were not seen during training. The experiments confirm that mixing data from complementary sources greatly improves monocular depth estimation. Our approach clearly outperforms competing methods across diverse datasets, setting a new state of the art for monocular depth estimation.

Highlights

  • D EPTH is among the most useful intermediate representations for action in physical environments [1]

  • We investigate ways to train robust monocular depth estimation models that are expected to perform across diverse environments

  • We argue that high-capacity deep models for monocular depth estimation can in principle operate on a fairly wide and unconstrained range of scenes

Read more

Summary

INTRODUCTION

D EPTH is among the most useful intermediate representations for action in physical environments [1]. Stereo cameras are a promising source of data [9], [10], but collecting suitable stereo images in diverse environments at scale remains a challenge. Structurefrom-motion (SfM) reconstruction has been used to construct training data for monocular depth estimation across a variety of scenes [11], but the result does not include independently moving objects and is incomplete due to the limitations of multiview matching. None of the existing datasets is sufficiently rich to support the training of a model that works robustly on real images of diverse scenes. We investigate ways to train robust monocular depth estimation models that are expected to perform across diverse environments. Our extensive experiments, which cover approximately six GPU months of computation, show that a model trained on a rich and diverse set of images from different sources, with an appropriate training procedure, delivers state-of-the-art results across a variety of environments.

RELATED WORK
EXISTING DATASETS
TRAINING ON DIVERSE DATA
Findings
EXPERIMENTS
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call