Abstract

This paper proposes a novel unsupervised learning framework for depth recovery and camera ego-motion estimation from monocular video. The framework exploits the optical flow (OF) property to jointly train the depth and the ego-motion models. Unlike the existing unsupervised methods, our method extracts the features from the optical flow rather than from the raw RGB images, thereby enhancing unsupervised learning. In addition, we exploit the forward-backward consistency check of the optical flow to generate a mask of the invalid region in the image, and accordingly, eliminate the outlier regions such as occlusion regions and moving objects for the learning. Furthermore, in addition to using view synthesis as a supervised signal, we impose additional loss functions, including optical flow consistency loss and depth consistency loss, as additional supervision signals on the valid image region to further enhance the training of the models. Substantial experiments on multiple benchmark datasets demonstrate that our method outperforms other unsupervised methods.

Highlights

  • Depth recovery and camera ego-motion estimation from monocular video are fundamental topics in computer vision with numerous applications in industry, including robotics, driverless vehicles, and navigation systems

  • By virtue of the optical flow property, the framework extracts the features from the optical flow rather than from the raw RGB images, thereby enhancing unsupervised learning; We eliminate the outlier regions such as occlusion regions and moving objects for the learning by generating a mask of the invalid region in the scene according to the forward-backward consistency of the optical flow, thereby preventing the training from being inhibited and improving the performance; We propose optical flow consistency loss and depth consistency loss as additional supervision signals to further enhance the training of the models; We conduct extensive experiments on multiple benchmark datasets, and the results demonstrate that our method outperforms the existing unsupervised algorithms

  • We evaluated the performance of PoseNet on the official

Read more

Summary

Introduction

Depth recovery and camera ego-motion estimation from monocular video are fundamental topics in computer vision with numerous applications in industry, including robotics, driverless vehicles, and navigation systems. Traditional solutions to these tasks rely on binocular stereo techniques or structure-from-motion methods, which reconstruct. Learning-based methods can be classified into two groups including supervised and unsupervised methods in terms of whether they rely on ground truth for training. Supervised methods learn the functions to map the depth and egomotion to the image by minimizing the differences between the estimated values and the related ground truth [5,6,7,8,9,10,11,12,13,14,15]. Supervised methods need a massive quantity of ground truth data to train the model, which is both costly and difficult to get in reality.

Objectives
Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call