Superb Monocular Depth Estimation Based on Transfer Learning and Surface Normal Guidance.

Kang Huang,Wang Zhang,Shouqian Chen,Haogang Qi,Fengshang Zhao,Xingtian Qu,Zhen Chen

doi:10.3390/s20174856

Abstract

Accurately sensing the surrounding 3D scene is indispensable for drones or robots to execute path planning and navigation. In this paper, a novel monocular depth estimation method was proposed that primarily utilizes a lighter-weight Convolutional Neural Network (CNN) structure for coarse depth prediction and then refines the coarse depth images by combining surface normal guidance. Specifically, the coarse depth prediction network is designed as pre-trained encoder–decoder architecture for describing the 3D structure. When it comes to surface normal estimation, the deep learning network was designed as a two-stream encoder–decoder structure, which hierarchically merges red-green-blue-depth (RGB-D) images for capturing more accurate geometric boundaries. Relying on fewer network parameters and simpler learning structure, better detailed depth maps are produced than the existing states. Moreover, 3D point cloud maps reconstructed from depth prediction images confirm that our framework can be conveniently adopted as components of a monocular simultaneous localization and mapping (SLAM) paradigm.

Highlights

IntroductionImage-based depth prediction has been extensively studied and widely applied to 3D scene understanding tasks, such as structure from motion (SFM) [1,2], simultaneous localization and mapping (SLAM) [3,4], 3D object detection [5], etc
Coarse depth images generated from the coarse depth estimation (CDE) network are fed to the RGB-D
The encoder layers were designed based on pre-trained deep learning model originally for image classification

Summary

Introduction

Image-based depth prediction has been extensively studied and widely applied to 3D scene understanding tasks, such as structure from motion (SFM) [1,2], simultaneous localization and mapping (SLAM) [3,4], 3D object detection [5], etc. The computer vision method, i.e., image-based depth estimation, defines image depth as the distance from the object point corresponding to each pixel to the camera and exploits clues of images like linear perspective, focus, occlusion, texture, shadow, gradient, etc. All the image-based methods can be summarized as two classes: stereo vision methods and monocular methods. The stereo vision methods are heavily dependent on natural light in a natural environment to collect images and is sensitive to changes in illumination angle and changes in illumination intensity. The differences in image matching of the two pictures will result in considerable differences from the matching algorithm

Methods

Results

Conclusion