Abstract

Estimating depth from a single image is challenging because a single 2D image may correspond to many different 3D scenes with the same depth. While most deep learning based depth prediction methods extract depth features using small convolutional kernels with small receptive fields, which results in deformed depth edges and inaccurate depth values of distant objects in the depth estimation results. Aiming at this problem, we propose a depth estimation network based on Swin Transformer and the encoder-decoder structure. We construct the encoder using the Swin Transformer network, which can encode long-range spatial dependency and extract features on various scales and across different channels. The decoder of the proposed network is in charge of fusing the features from the encoder by the operations of interpolation, concatenation, and convolution. Experiments on KITTI and NYUv2 datasets show that our proposed network can get more accurate depth edges and depth values than the state-of-the-art methods.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call