Abstract

Sparse depth completion generates a dense depth image from its sparse measurement with the guidance of RGB image. In this paper, we propose attention guided sparse depth completion using convolutional neural networks, called AGNet. We adopt attention learning to get geometric cues for depth regression from RGB image and capture multi-scale depth structures. First, we use RGB image and valid binary mask from the input sparse depth image as input to generate an initial coarse depth image and its confidence map. Then, we generate attention map for depth refinement using a cross spatial attention module (CSAM). CSAM separately takes the RGB image with valid mask and invalid mask as input to make full use of color information in attention map. Next, we build a multi-scale learning network to encode the sparse depth image with different scales, thus leading to accurate depth completion. AGNet takes advantage of the input sparse depth image for encoding coarse features with a moderate model size. Experimental results show that AGNet achieves comparable performance with state-of-the-art methods for depth completion in NYU v2 dataset.

Highlights

  • Accurate depth prediction of an observed scene is a key prerequisite for subsequent vision tasks such as object detection, object recognition, and 3D scene reconstruction

  • The cross spatial attention module takes adjacent RGB pixels and coarse depth features weighted by its confidence map as input, and outputs the spatial attention map as a guidance for multi-scale depth inputs

  • The cross spatial attention takes RGB features as a guidance for depth regression, and the multi-scale attention maps are used as training samples for depth estimation

Read more

Summary

INTRODUCTION

Accurate depth prediction of an observed scene is a key prerequisite for subsequent vision tasks such as object detection, object recognition, and 3D scene reconstruction. We propose attention guided sparse depth completion using CNNs. We adopt attention learning to effectively learn pixel/position relationship of color image and further predict the accurate depth values guided by the attention map. We build a cross spatial attention module to encode contextual information of RGB image and generate attention map for multi-scale depth prediction. We use a similar network structure proposed by Chang et al [28] Unlike this network, AGNet utilizes average pooling instead of max pooling because average pooling better retains background information for RGB guided depth completion. The cross spatial attention module takes adjacent RGB pixels and coarse depth features weighted by its confidence map as input, and outputs the spatial attention map as a guidance for multi-scale depth inputs. We use multi-scale learning to perform dense depth prediction and bridge modal difference between RGB and depth images. We get attention maps with one-eighth size, quarter-size, and half-size corresponding to coarse depth features

MULTI-SCALE LEARNING MODULE
NYU v2 DATASET AND EVALUATION METRICS
COMPARISON WITH STATE-OF-THE-ARTS
ABLATION STUDIES We conduct ablation studies on the network architecture
CONCLUSION

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.