Abstract

Vision based depth estimation plays a significant role in Intelligent Transportation Systems (ITS) because of its low cost and high efficiency, which can be used to analyze driving environment, improve driving safety, etc. Although recently proposed approaches abandon time consuming pre-processing or post-processing steps and achieve an end-to-end prediction manner, fine details may be lost through max-pooling based encode modules. To tackle this problem, we propose Multi-Scale Dilated Convolution Network (MSDC-Net), a dilated convolution based deep network. For the feature encoding and decoding part, dilated layers maintain the scale of original image and reduce lost details. After that, a pyramid dilated feature extraction module is added to integrate the knowledge learned through forward steps with different receptive fields. The proposed approach is evaluated on KITTI dataset, and achieves a state-of-the-art result on the dataset.

Highlights

  • Compared with Light Detection And Ranging (LiDAR), structured light [5] and time-of-flight [6], vision based depth estimation is able to construct a full dense depth map with less expense and only requires photographic equipment which could be embedded into other portable devices [7], [8]

  • 2) SMOOTHNESS LOSS (LS) only with the pixel-wise loss, the model can predict a reasonable result in unseen scenarios by fine-tuning on sparse ground truth data, it is still regarded as an approximation process of unknown regions and it is hard to mimic distinct shapes or edges of true scenes

  • EXPERIMENT we describe the detailed setups of the MSDC-Net such as running environment, parameters and etc

Read more

Summary

INTRODUCTION

Before the deep learning methods applied on the image processing problem Classical methods, such as maxflow [21], [30], belief propagation [31], Markov Random Field (MRF), Conditional Random Field (CRF) [32], and Semi Global Matching (SGM) [33], dominate in the depth estimation area. The strong ability to mimic detailed depth information of a scene enables these networks to yield comparable results when working on monocular tasks [40], [41] Among these state-of-the-art models [42]–[44], most of them depend on the encode and decode module to generate a dense depth map. Dilated convolution layers maintain the resolution and receptive field of the original image by inserting holes in the convolution kernels With this advantage, detailed information can be preserved in the encode module and provides learnable knowledge for decoding process.

RELATED WORK
REGRESSION OUTPUT MODULE
LOSS FUNCTION
EXPERIMENT
IMPLEMENTATION DETAILS
Findings
CONCLUSION

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.