Abstract

In recent years, self-supervised monocular depth estimation has gained popularity among researchers because it uses only a single camera at a much lower cost than the direct use of laser sensors to acquire depth. Although monocular self-supervised methods can obtain dense depths, the estimation accuracy needs to be further improved for better applications in scenarios such as autonomous driving and robot perception. In this paper, we innovatively combine soft attention and hard attention with two new ideas to improve self-supervised monocular depth estimation: (1) a soft attention module and (2) a hard attention strategy. We integrate the soft attention module in the model architecture to enhance feature extraction in both spatial and channel dimensions, adding only a small number of parameters. Unlike traditional fusion approaches, we use the hard attention strategy to enhance the fusion of generated multi-scale depth predictions. Further experiments demonstrate that our method can achieve the best self-supervised performance both on the standard KITTI benchmark and the Make3D dataset.

Highlights

  • It is essentially an ill-posed problem to obtain depth from an RGB image, recent studies have proved that convolutional neural networks (CNNs) can estimate depth by learning the hidden depth clues from the RGB images

  • As with most previous papers, we used the KITTI benchmark for training and evaluation

  • Since self-supervised training based on monocular video has no scale information, we have reported the results using the per-image median ground-truth scaling [7]

Read more

Summary

Introduction

There are various methods of obtaining depth information, such as laser sensors, or binocular or multi-cameras. These might be not available in some cases due to high costs or high environmental demands. Monocular images or videos are more common in most scenes This has led to the rapid development of monocular depth estimation as only one single camera is required. Self-supervised learning methods [7,8,9,10] construct new supervised signals by using depth predictions as intermediate variables through geometric constraints of monocular video or stereo images

Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.