Abstract

We present an occlusion-aware unsupervised neural network for jointly learning three low-level vision tasks from monocular videos: depth, optical flow, and camera motion. The system consists of three different predicting sub-networks simultaneously coupled by combined loss terms and is capable of computing each task independently on test samples. Geometric constraints extracted from scene geometry which have traditionally been used in bundle adjustment or pose-graph optimization are formed as various self-supervisory signals during our end-to-end learning approach. Different from prior works, our image reconstruction loss also takes account of optical flow. Moreover, we impose novel 3D flow consistency constraints over the predictions of all the three tasks. By explicitly modeling occlusion and taking utilization of both 2D and 3D geometry relationships, abundant geometric constraints are formed over estimated outputs, enabling the system to capture both low-level representations and high-level cues to infer thinner scene structures. Empirical evaluation on the KITTI dataset demonstrates the effectiveness and improvement of our approach: (1) monocular depth estimation outperforms state-of-the-art unsupervised methods and is comparable to stereo supervised ones; (2) optical flow prediction ranks top among prior works and even beats supervised and traditional ones especially in non-occluded regions; (3) pose estimation outperforms established SLAM systems under comparable input settings with a reasonable margin.

Highlights

  • The ability to perceive visual environment of one agent is more required than ever due to the advancing expand in industries such as robotics [1], autonomous driving [2] and augmented reality [3], among which has many crucial tasks to fulfill including reasoning one’s ego-motion and the scene structure

  • Structure from motion (SfM) [4,5,6] is a long-standing task in computer vision which aims at reconstructing camera motion and scene structure, but often hard to integrate reasonable priors for small camera translation or outliers from low-level sparse correspondences

  • The main idea of this paper is by taking account of optical flow simultaneously in addition to depth and camera motion estimation, jointly learning a network with both 2D and 3D geometry transformations during training to create additional constraints for better exploitation of nature geometry structure

Read more

Summary

Introduction

The ability to perceive visual environment of one agent is more required than ever due to the advancing expand in industries such as robotics [1], autonomous driving [2] and augmented reality [3], among which has many crucial tasks to fulfill including reasoning one’s ego-motion and the scene structure. Visual Simultaneous localization and mapping (VSLAM) [7,8,9] use handcrafted sparse or dense abstractions to represent geometry and limited to a particular kind of scenes They suffer from absolute scale regarding monocular approach and noisy data regarding depth sensor. The main idea of this paper is by taking account of optical flow simultaneously in addition to depth and camera motion estimation, jointly learning a network with both 2D and 3D geometry transformations during training to create additional constraints for better exploitation of nature geometry structure. Learn an unsupervised deep neural network from monocular videos that predicts depth, optical flow, and camera motion simultaneously at training time coupled by combined loss term. The 3D constraint contains two part: 3D point alignment loss and a novel 3D flow loss

Related Work
Deep Learning with Geometry for Scene Understanding
Supervised Videos Learning
Unsupervised Videos Learning
Method
System Overview
Bidirectional Flow Loss as Occlusion Mask
Mutually Supervised 3D Geometric Consistency Loss
Smoothness Constraint
Network Architecture
Training Details
Depth Evaluation
Optical Flow Evaluation
Pose Evaluation
Conclusions

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.