Swin-Depth: Using Transformers and Multi-Scale Fusion for Monocular-Based Depth Estimation

Zeyu Cheng,Chengkai Tang,Yi Zhang

doi:10.1109/jsen.2021.3120753

Abstract

Depth estimation using monocular sensors is an important and basic task in computer vision. It has a wide range of applications in robot navigation, autonomous driving, etc., and has received extensive attention from researchers in recent years. For a long time before, monocular depth estimation was based on convolutional neural networks, but its inherent convolution operation showed limitations in modeling large-scale dependence. Using Transformers instead of convolutional neural networks to perform monocular depth estimation provides a good idea, but there is a problem that the calculation complexity is too high and the number of parameters is too large. In response to these problems, we proposed Swin-Depth, which is a Transformer-based monocular depth estimation method that uses hierarchical representation learning with linear complexity for images. In addition, there is an attention module based on multi-scale fusion in Swin-Depth to strengthen the network’s ability to capture global information. Our proposed method effectively reduces the excessive parameters in the monocular depth estimation using transformer, and a large number of research experiments show that Swin-Depth has achieved state-of-the-art in challenging datasets of indoor and outdoor scenes.

Full Text