Abstract

Monocular depth estimation from a single still image is one of the most challenging fundamental problems in scenes understanding. As a pixel level regression problem, the inherent continuous large range of depth itself causes three main difficulties: (1) The unbalance between large value and small value during regressing; (2) Exploiting multi-scale contextual information; (3) Preserving spatial and semantic structures. To overcome these difficulties, this paper presents a novel spatially coherent sliced network for monocular depth estimation (SCS-Net). It first uses feature pyramids network to form the feature fusions of hierarchical feature maps. Then, the depth is sliced to supervise the estimation in different ranges generated from the feature fusions of multi-scale contexts. The holistic depth is invoked as the supervised signal of aggregated feature maps to ensure the learning of global structure of the scene. The self-spatial-attention mechanism further takes advantage of the semantics of objects and the spatial pixel relations to maintain the coherence of space and semantics in depth sliced estimation. Finally, the regulations of second-order depth information further make the estimated boundaries not too smooth. Numerous ablation experiments and comparisons on three popular indoor and outdoor benchmark datasets indicate the effectiveness and robustness of the proposed approach. • Spatially coherent sliced network regresses sliced depth from feature fusions pyramidally. • Our network introduces the scale and shift regression to maintain spatial coherence. • Self-spatial-attention mechanism refine potentially compromised semantic coherence. • Experimental results show competitive results and promising performance in urban road scenes.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call