Abstract

Abstract. Supervised learning based methods for monocular depth estimation usually require large amounts of extensively annotated training data. In the case of aerial imagery, this ground truth is particularly difficult to acquire. Therefore, in this paper, we present a method for self-supervised learning for monocular depth estimation from aerial imagery that does not require annotated training data. For this, we only use an image sequence from a single moving camera and learn to simultaneously estimate depth and pose information. By sharing the weights between pose and depth estimation, we achieve a relatively small model, which favors real-time application. We evaluate our approach on three diverse datasets and compare the results to conventional methods that estimate depth maps based on multi-view geometry. We achieve an accuracy δ1:25 of up to 93.5 %. In addition, we have paid particular attention to the generalization of a trained model to unknown data and the self-improving capabilities of our approach. We conclude that, even though the results of monocular depth estimation are inferior to those achieved by conventional methods, they are well suited to provide a good initialization for methods that rely on image matching or to provide estimates in regions where image matching fails, e.g. occluded or texture-less regions.

Highlights

  • Dense depth estimation is one of the most important and intensively studied tasks in photogrammetric computer vision

  • We compare the results achieved by our supervised Monocular Depth Estimation (SMDE) approach to results obtained by conventional methods that estimate depth maps based on multi-view geometry, and we argue why it can be feasible to rely on SMDE instead

  • We quantitatively assess the results of the Self-supervised Monocular Depth Estimation approach with respect to the ground truth based on three metrics, namely the Root Mean Square Error (RMSE), the relative L1-Norm (L1-rel) as described in (Ruf et al, 2019) and the Accuracy

Read more

Summary

Introduction

Dense depth estimation is one of the most important and intensively studied tasks in photogrammetric computer vision. This process, known as Monocular Depth Estimation, is motivated by the capabilities of humans to guess depth estimates from a single image of a known scene. Similar to the empirical knowledge that humans establish throughout their lifetime, state-of-the-art convolutional neural networks (CNNs) are able to efficiently learn discriminative image cues that allow them to infer depth information from a new, so far unseen, image. This only holds if the scene depicted in the new image is the same or at least similar to the scene that is covered by the training data

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call