MobileDepth: Monocular Depth Estimation Based on Lightweight Vision Transformer

Yundong Li,Xiaokun Wei

doi:10.1080/08839514.2024.2364159

Abstract

ABSTRACT As deep learning takes off, monocular depth estimation based on convolutional neural networks (CNNs) has made impressive progress. CNNs are superior at extracting local characteristics from a single image; however, they are unable to manage long-range dependence and thus have a substantial impact on the performance of monocular depth estimation. In addition to this, as architectures based on CNNs frequently utilize down sampling operations, numbers of pixel-level features, which are extremely crucial for dense prediction tasks, are lost in the encoder phase. Unlike CNNs, ViT is capable of capturing global feature information, but it requires numbers of parameters and data augmentation owing to its lack of inductive bias. To address the aforementioned difficulties, in this study, we propose a Dilated Self Attention Block (DSAB) as well as a Local and Global Feature Extraction (LGFE) module. The former resolves the inference speed issue of standard ViT models, and we accomplish this by limiting the number of self-attention computations among tokens. The latter combines the advantages of CNNs and ViT, first extracting local representation information in low-dimensional space through standard convolution and then mapping the input tensor to high-dimensional space to capture global information, achieving the simultaneous extraction of global and local characteristics.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

MobileDepth: Monocular Depth Estimation Based on Lightweight Vision Transformer

Abstract

Talk to us

Similar Papers

More From: Applied Artificial Intelligence

Lead the way for us

Journal: Applied Artificial Intelligence	Publication Date: Jul 1, 2024
License type: CC BY-NC 4.0

Similar Papers

Medical Image Classification with a Hybrid SSM Model Based on CNN and Transformer
Can Hu ... Ning Cao
Electronics | VOL. 13
Can Hu, et. al.Can Hu ... Ning Cao
05 Aug 2024
Electronics | VOL. 13

Fully Automatic Glioma Segmentation Algorithm of Magnetic Resonance Imaging Based on 3D-UNet With More Global Contextual Feature Extraction: An Improvement on Insufficient Extraction of Global Features
Hengyi Tian ... Yu Wang
Sichuan da xue xue bao. Yi xue ban = Journal of Sichuan University. Medical science edition | VOL. 55
Hengyi Tian, et. al.Hengyi Tian ... Yu Wang
20 Mar 2024
Sichuan da xue xue bao. Yi xue ban = Journal of Sichuan University. Medical science edition | VOL. 55

A Dual-Branch Fusion Network Based on Reconstructed Transformer for Building Extraction in Remote Sensing Imagery.
Yitong Wang ... Aixia Dou
Sensors (Basel, Switzerland) | VOL. 24
Yitong Wang, et. al.Yitong Wang ... Aixia Dou
07 Jan 2024
Sensors (Basel, Switzerland) | VOL. 24

Combining transformer global and local feature extraction for object detection
Tianping Li ... Dongmei Wei
Complex & Intelligent Systems | VOL. 10
Tianping Li, et. al.Tianping Li ... Dongmei Wei
15 Apr 2024
Complex & Intelligent Systems | VOL. 10

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

MobileDepth: Monocular Depth Estimation Based on Lightweight Vision Transformer

Abstract

Talk to us

Similar Papers

More From: Applied Artificial Intelligence