Abstract

Monocular depth estimation (MDE) is a fundamental problem in computer vision. Recently, self-supervised learning (SSL) approaches have attracted significant attention due to the ability to train an MDE network without ground-truth depth data. However, the performance of most existing SSL-MDE methods is yet limited by the available real training dataset, which are either binocular stereo pairs or monocular video sequences. In this paper, we propose a simple but effective generalization of SSL framework such that collections of multiple view Internet photos, a virtually unlimited source of real data, are enabled to train an MDE network. Combining the depth consistency and the mask that alleviates the interference such as moving objects, the network benefits from the real correspondences in adjacent views, thus achieving the improvement. Experiments show that the generalization of Monodepth2 via the proposed method not only leads to superior performance than itself and some data-driven MDE methods, but also stably boosts the performance of multiple state-of-the-art SSL-MDE methods. Besides, experiments on SeasonDepth, a dataset with various environmental conditions, show the good generalization capability of our proposed method.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call