Abstract

Monocular depth estimation based on unsupervised learning has attracted great attention due to the rising demand for lightweight monocular vision sensors. Inspired by multi-task learning, semantic information has been used to improve the monocular depth estimation models. However, multi-task learning is still limited by multi-type annotations. As far as we know, there are scarcely any large public datasets that provide all the necessary information. Therefore, we propose a novel network architecture Semantic-Feature-Aided Monocular Depth Estimation Network (SFA-MDEN) to extract multi-resolution depth features and semantic features, which are merged and fed into the decoder, with the goal of predicting depth with the support of semantics. Instead of using loss functions to relate the semantics and depth, the fusion of feature maps for semantics and depth is employed to predict the monocular depth. Therefore, two accessible datasets with similar topics for depth estimation and semantic segmentation can meet the requirements of SFA-MDEN for training sets. We explored the performance of the proposed SFA-MDEN with experiments on different datasets, including KITTI, Make3D, and our own dataset BHDE-v1. The experimental results demonstrate that SFA-MDEN achieves competitive accuracy and generalization capacity compared to state-of-the-art methods.

Highlights

  • Depth estimation plays a fundamental role in numerous application scenarios, such as image reconstruction [1], object detection [2], semantic segmentation [3], pose estimation [4], and medical image processing [5]

  • Monocular depth estimation refers to the scene depth recovery from a single two-dimensional image captured by the camera [8], which benefits the development of lightweight sensors

  • A novel network architecture with two branches is proposed to couple the semantic segmentation feature into a monocular depth estimation network aimed at improving the robustness and precision of depth estimation models

Read more

Summary

Introduction

Depth estimation plays a fundamental role in numerous application scenarios, such as image reconstruction [1], object detection [2], semantic segmentation [3], pose estimation [4], and medical image processing [5]. A novel network architecture with two branches is proposed to couple the semantic segmentation feature into a monocular depth estimation network aimed at improving the robustness and precision of depth estimation models. Traditional multi-task learning for predicting depth and semantic segments simultaneously employs a multi-term loss, which requires semantic annotations, but depth self-labels referring to stereo images or image sequences from the training datasets. The remainder of the paper is organized as follows: Section 2 reviews existing monocular depth estimation models based on deep learning, especially the methods involving semantics; the structure and the TSTB training strategy of SFA-MDEN are presented in Section 3; experimental results and analysis are presented in Section 4; and, Section 5 presents the concluding remarks

Monocular Depth Estimation
Monocular Depth Estimation with Semantics
Framework and Training Strategy
Network Architectures
Loss Function
Implementation Details and Metrics
Eigen Split of KITTI
Method
Findings
Self-Made Datasets
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call