Self-supervised monocular depth estimation with direct methods

Haixia Wang,Yehao Sun,Q.M Jonathan Wu,Xiao Lu,Xiuling Wang,Zhiguo Zhang

doi:10.1016/j.neucom.2020.10.025

Abstract

Depth estimation is crucial to understanding the geometry of a scene in robotics and computer vision. Traditionally, depth estimators can be trained with various forms of self-supervised stereo data or supervised ground-truth data. In comparison to the methods that utilize stereo depth perception or ground-truth data from laser scans, determining depth relation using an unlabeled monocular camera proves considerably more challenging. Recent work has shown that CNN-based depth estimators can be learned using unlabeled monocular video. Without needing the stereo data or ground-truth depth data, learning with monocular self-supervised strategies can utilize much larger and more varied image datasets. Inspired by recent advances in depth estimation, in this paper, we propose a novel objective that replaces the use of explicit ground-truth depth or binocular stereo depth with unlabeled monocular video sequence data. No assumptions about scene geometry or pre-trained information are used in the proposed architecture. To enable a better pose prediction, we propose the use of an improved differentiable direct visual odometry (DDVO), which is fused with an appearance-matching loss. The auto-masking approach is introduced in the DDVO depth predictor to filter out the low-texture area or occlusion area, which can easily reduce matching error, from one frame to the subsequent frame in the monocular sequence. Additionally, we introduce a self-supervised loss function to fuse the auto-masking segment and the depth-prediction segment accordingly. Our method produces state-of-the-art results for monocular depth estimation on the KITTI driving dataset, even outperforming some supervised methods that have been trained with ground-truth depth.

Full Text