Monocular depth estimation, as one of the fundamental tasks of computer vision, plays a crucial role in three-dimensional (3D) scene understanding and perception. Usually, deep learning methods recover monocular depth maps using continuous regression manners by minimizing the errors between the ground-truth depth and the predicted depth. However, fine depth features may not be fully captured through layer-by-layer coding, which is prone to low spatial resolution depth maps and insufficient details. Furthermore, it usually converges slowly and suffers from unsatisfactory results. To tackle these issues, we propose a novel model, named context-based ordinal regression network (CORNet), to reconstruct monocular depth maps in the ordinal regression manner with context information in this paper. Firstly, we put forward a novel context-based encoder with a feature transformation (FT) module to learn context information and details from inputs, and output multi-scale feature maps. Then, we design a boundary enhancement module (BEM) with a spatial attention mechanism following each operation of feature fusion, which captures boundary features in the scene to enhance the border depth. Finally, a feature optimization module (FOM) is designed to fuse and optimize the multi-scale features and boundary features to strengthen depth learning. What’s more, we introduce an ordinal weighted inference to predict depth maps from probabilities and discretization values. Experiments and results on two challenging datasets, KITTI and NYU Depth V2, demonstrate that our proposed CORNet can estimate monocular depth maps effectively and obtain superior performance in capturing geometric features over existing methods.
Read full abstract