Abstract

Although depth estimation is a key technology for three-dimensional sensing applications involving motion, active sensors such as LiDAR and depth cameras tend to be expensive and bulky. Here, we explore the potential of monocular depth estimation (MDE) using a self-supervised approach. MDE is a promising technology, but supervised learning suffers from a need for accurate ground-truth depth data. Recent studies have enabled self-supervised training on an MDE model with only monocular image sequences and image-reconstruction errors. We pretrained networks using multiple datasets, including monocular and stereo image sequences. The main challenges posed by the self-supervised MDE model were occlusions and dynamic objects. We proposed novel loss functions to handle these problems in the form of min-over-all and min-with-flow losses, both based on the per-pixel minimum reprojection error of Monodepth2 and extended to stereo images and optical flow. With extensive pretraining and novel losses, our model outperformed existing unsupervised approaches in quantitative depth estimation and the ability to distinguish small objects against a background, as evaluated by KITTI 2015.

Highlights

  • Three-dimensional (3D) vision involves inferring 3D geometric information from two-dimensional (2D) images

  • As depth information is important to moving vehicles, we used driving datasets such as KITTI, Cityscapes, Waymo and A2D2 [1]–[4]

  • With a convolutional neural network (CNN), numerous kernels are automatically adjusted for accurate depth prediction, and less pre- and postprocessing and regularization are required

Read more

Summary

Introduction

Three-dimensional (3D) vision involves inferring 3D geometric information from two-dimensional (2D) images. Monocular depth estimation (MDE) produces a dense depth map from a single image. The weights to be multiplied are optimized to predict true depths during training Their results are relatively inaccurate, with average relative depth error rates of greater than 30%. With a CNN, numerous kernels are automatically adjusted for accurate depth prediction, and less pre- and postprocessing and regularization are required. While these methods produce the most accurate results [7], [9], with a relative depth error rate below 10%, they require depthlabeled datasets. The core of self-supervision is photometric loss, which pairs temporally adjacent or stereo images and synthesizes one image from the other using an estimated depth map and the relative pose between them. The difference between the synthesized and original images represents the depth and pose estimation error, VOLUME XX, 2017

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call