Abstract

To take full advantage of the information of images captured by drones and given that most existing monocular depth estimation methods based on supervised learning require vast quantities of corresponding ground truth depth data for training, the model of unsupervised monocular depth estimation based on residual neural network of coarse–refined feature extractions for drone is therefore proposed. As a virtual camera is introduced through a deep residual convolution neural network based on coarse–refined feature extractions inspired by the principle of binocular depth estimation, the unsupervised monocular depth estimation has become an image reconstruction problem. To improve the performance of our model for monocular depth estimation, the following innovations are proposed. First, the pyramid processing for input image is proposed to build the topological relationship between the resolution of input image and the depth of input image, which can improve the sensitivity of depth information from a single image and reduce the impact of input image resolution on depth estimation. Second, the residual neural network of coarse–refined feature extractions for corresponding image reconstruction is designed to improve the accuracy of feature extraction and solve the contradiction between the calculation time and the numbers of network layers. In addition, to predict high detail output depth maps, the long skip connections between corresponding layers in the neural network of coarse feature extractions and deconvolution neural network of refined feature extractions are designed. Third, the loss of corresponding image reconstruction based on the structural similarity index (SSIM), the loss of approximate disparity smoothness and the loss of depth map are united as a novel training loss to better train our model. The experimental results show that our model has superior performance on the KITTI dataset composed by corresponding left view and right view and Make3D dataset composed by image and corresponding ground truth depth map compared to the state-of-the-art monocular depth estimation methods and basically meet the requirements for depth information of images captured by drones when our model is trained on KITTI.

Highlights

  • In recent decades, drones are widely used in various fields due to low cost, high flexibility and reliability, such as drone aerial photography, drone rescue, drone plant protection and so on [1]

  • In Section 3.3.2, several factors affecting the performance of monocular depth estimation were discussed and analyzed as follows: (a) We compared our model in which the input images were processed by the pyramid method with the one that the pyramid method had not used, thereby demonstrating that input images processed by the pyramid method could improve the sensitivity of depth information in monocular images and reduce the influence of input image size on the estimation result. (b) To test the outstanding advantages of residual neural networks, we compared the performance of our residual network with

  • VGG-16 in our model. (c) Our model was compared with the one without long skip connections between corresponding layers in the neural network of coarse feature extractions and the deconvolution neural network of refined feature extractions to show that long skip connection can predict high proposed loss function, our novel loss function was compared with other well-known training losses

Read more

Summary

Introduction

Drones are widely used in various fields due to low cost, high flexibility and reliability, such as drone aerial photography, drone rescue, drone plant protection and so on [1]. Though shape from motion (SFM) [5], shape from shading (SFS) [6] and depth from focus or defocus (DFF/DFD) [7,8] are considered to be classical algorithms for monocular depth estimation, these methods are not widely used due to the device cost, high standard requirement for taking images, and the result is susceptible to occlusion and correspondence matching and so on. To obtain the depth information of 2D images methods fall mainly into two major categories based on deep neural network according to whether vast quantities of corresponding ground truth depth data for training are required: one is the monocular depth estimation based supervised learning, another is the monocular depth estimation based unsupervised learning

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call