Abstract

Understanding 3D scene geometry is a fundamental research topic in computer vision, including various subproblems, such as depth prediction, visual odometry, optical flow, etc. With the advent of artificial intelligence methods like deep learning, many approaches have emerged to deal with such problems in an end-to-end manner. These pipelines take the 3D understanding task as a nonlinear optimization problem, with the purpose of minimizing the cost function of the whole framework. Here, we present a self-supervised framework for jointly learning the monocular depth and camera's ego-motion from unlabeled, unstructured, and monocular video sequences. We propose a forward-backward consistency constraint on view reconstruction to capture temporal relations across adjacent frames, whose purpose is to explore and make full use of the bidirectional projection information. A simple and practicable improvement on the design of cost function is proposed to enhance the estimated accuracy. Due to the fact that our improvement is a lightweight and general module, it can be integrated into any self-supervised architectures seamlessly, and more accurate results can be obtained. The evaluation on the KITTI dataset demonstrates that our approach is highly efficient and performs better than the existing works in pose estimation, while the results in depth estimation perform comparably with the existing ones.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call