Abstract

Depth estimation from a single image is a crucial but challenging task for reconstructing 3D structures and inferring scene geometry. However, most existing methods fail to extract more detailed information and estimate the distant small-scale objects well. In this paper, we propose a monocular depth estimation based on multi-scale feature fusion. Specifically, to obtain input features of different scales, we first feed the input images of different scales to pre-trained residual networks with sharing weights. Then, an attention mechanism is used to learn the salient features at different scales, which can integrate detailed information at large scale feature maps and scene information at small scale feature maps. Furthermore, inspired by the dense atrous spatial pyramid pooling in semantic segmentation, we build a multi-scale feature fusion dense pyramid to further improve the ability of the feature extraction. Last, a scale-invariant error loss is used to predict depth maps in log space. We evaluate our method on several public benchmark datasets (including NYU Depth V2 and KITTI). The experiment results show that the proposed method obtains better performance than the existing methods and achieves state-of-the-art results.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call