Abstract

Monocular depth estimation is an ill-posed problem because infinite 3D scenes can be projected to the same 2D scenes. Most recent methods focus on image-level information from deep convolutional neural networks, while training them may suffer from slow convergence and accuracy degeneration, especially for deeper network and more feature channels. Based on an encoder-decoder framework, we propose a novel Residual DenseASPP Network. In our Residual DenseASPP network, we define features as low/mid/high vision features and use two-kinds of skip connection to learn useful features with certain layers, where feature concentration in the dense block is used to generate more features in the same layer, and feature summation in the residual block is used to increase backward gradient. The experimental results show that high vision features require more channels by feature concentration, while low/mid vision features need better convergence by feature summation. Experiments show that our proposed approach achieves state-of-the-art performance on both NYUv2 and Make3D datasets.

Highlights

  • Monocular depth estimation aims to estimate depth information of a scene from an RGB image, which is a fundamental problem of computer vision with many potential applications, such as semantic segmentation [1], [2], object detection [3]–[5], human pose estimation [6], [7], 3D reconstruction [8], simultaneous localization and mapping [9]

  • To solve the above problems, we propose a novel Residual DenseASPP Network based on an encoder-decoder framework

  • We focus on what kind of low/mid/high vision features is important in depth estimation

Read more

Summary

INTRODUCTION

Monocular depth estimation aims to estimate depth information of a scene from an RGB image, which is a fundamental problem of computer vision with many potential applications, such as semantic segmentation [1], [2], object detection [3]–[5], human pose estimation [6], [7], 3D reconstruction [8], simultaneous localization and mapping [9]. Inspired by the remarkable success of image classification, most recent methods study this work based on deep networks They learn visual representations for depth estimation in an end-to-end multi-layer fashion [10], in which features of various receptive fields are generated by convolution and pooling operations. This case aggravates in a DenseNet with multi-scale ASPP, which can cause the model overfitting This suggests that searching better architecture to reduce model complexity is not sufficient in depth estimation. The main contributions of our work lie in three folds: (1) We propose a novel Residual DenseASPP Network, in which we fully exploit network architecture for low/mid/high vision features by fusion two-types of skip connection. The visibility results prove that our method can give good predictions in many challenges, including small objects, complex boundaries, illuminations, objects with a big change in depth

RELATED WORK
FROM ATROUS CONVOLUTION TO RESIDUAL DenseASPP
RESIDUAL DenseASPP
EXPERIMENTS
DATASETS
BASELINES
ERROR METRICS
COMPARISON WITH THE STATE-OF-THE-ART
Findings
CONCLUSIONS
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call