Depth and surface normal estimations are important for 3D geometric perception, which has numerous applications including autonomous vehicles and robots. In this paper, we propose a lightweight Multi-stage Information Diffusion Network (MIDNet) for the simultaneous prediction of depth and surface normals from a single RGB image. To obtain semantic and detail-preserving features, we adopt a high-resolution network as our backbone to learn multi-scale features, which are then fused into shared features for the two tasks. To mutually boost each task, a Cross-Correlation Attention Module (CCAM) is proposed to adaptively integrate information for the prediction of the two tasks in multiple stages, including feature-level information interaction and task-level information interaction. Ablation studies show that the proposed multi-stage information diffusion strategy can boost the performance gain for the two tasks at different levels. Compared to current state-of-the-art methods on the NYU Depth V2, Stanford 2D-3D-Semantic and KITTI datasets, our method achieves superior performance for both monocular depth and surface normal estimation.
Read full abstract