Abstract
Geometric and semantic contexts are essential to solving the ill-posed problem of monocular depth estimation (MDE). In this paper, we propose a deep MDE framework that can aggregate dual-modal structural contexts for monocular depth estimation (DSC-MDE). First, a cross-shaped context (CSC) aggregation module is developed to globally encode the geometric structures in depth maps observed under the fields of vision of robots/autonomous vehicles. Next, the CSC-encoded geometric features are further modulated with semantic context in an object-regional context (ORC) aggregation module. Finally, to train the proposed network, we present a focal ordinal loss (FOL), which pays more attention to distant samples to avoid the issue of over-relaxed constraints on these samples occurring in the ordinal regression loss (ORL). We compare the proposed model to recent methods with geometric and multi-modal contexts, and show that the proposed model obtains state-of-the-art performance on both indoor and outdoor datasets, including NYU-Depth-V2, Cityscapes and KITTI.
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have