Monocular Depth Estimation Network Based on Swin Transformer

Shangbin Yu,Renyan Zhang,Shuaiye Ma,Xinfang Jiang

doi:10.1088/1742-6596/2428/1/012019

Shangbin Yu, Renyan Zhang + Show 2 more

Open Access

https://doi.org/10.1088/1742-6596/2428/1/012019

Copy DOI

Abstract

Estimating depth from a single image is challenging because a single 2D image may correspond to many different 3D scenes with the same depth. While most deep learning based depth prediction methods extract depth features using small convolutional kernels with small receptive fields, which results in deformed depth edges and inaccurate depth values of distant objects in the depth estimation results. Aiming at this problem, we propose a depth estimation network based on Swin Transformer and the encoder-decoder structure. We construct the encoder using the Swin Transformer network, which can encode long-range spatial dependency and extract features on various scales and across different channels. The decoder of the proposed network is in charge of fusing the features from the encoder by the operations of interpolation, concatenation, and convolution. Experiments on KITTI and NYUv2 datasets show that our proposed network can get more accurate depth edges and depth values than the state-of-the-art methods.

Full Text