3D perception in intelligent VR/AR and autonomous vehicles (AV) applications is critical and attracting significant attention. The self-supervised monocular depth and ego-motion estimation serves as a more intelligent learning approach that provides the required scene depth and location for 3D perception. However, the existing self-supervised learning methods suffer from scale ambiguity, boundary blur and imbalanced depth distribution, limiting the practical applications of VR/AR and AV. In this paper, we propose a new self-supervised learning framework based on superpixel and normal constraints to address these problems. Specifically, we formulate a novel 3D edge structure consistency loss to alleviate the boundary blur of depth estimation. To address the scale ambiguity of estimated depth and ego-motion, we propose a novel surface normal network for efficient camera height estimation. The surface normal network is composed of a deep fusion module and a full-scale hierarchical feature aggregation module. Meanwhile, to realize the global smoothing and boundary discriminability of the predicted normal map, we introduce a novel fusion loss which is based on the consistency constraints of the normal in edge domains and superpixel regions. Experiments are conducted on several benchmarks, and the results illustrate that the proposed approach outperforms the state-of-the-art methods in depth, ego-motion and surface normal estimation.
Read full abstract