LGViT: A Local and Global Vision Transformer with Dynamic Contextual Position Bias Using Overlapping Windows

Qian Zhou,Huanhuan Wu,Hua Zou

doi:10.3390/app13031993

Abstract

Vision Transformers (ViTs) have shown their superiority in various visual tasks for the capability of self-attention mechanisms to model long-range dependencies. Some recent works try to reduce the high cost of vision transformers by limiting the self-attention module in a local window. As a price, the adopted window-based self-attention also reduces the ability to capture the long-range dependencies compared with the original self-attention in transformers. In this paper, we propose a Local and Global Vision Transformer (LGViT) that incorporates overlapping windows and multi-scale dilated pooling to robust the self-attention locally and globally. Our proposed self-attention mechanism is composed of a local self-attention module (LSA) and a global self-attention module (GSA), which are performed on overlapping windows partitioned from the input image. In LSA, the key and value sets are expanded by the surroundings of windows to increase the receptive field. For GSA, the key and value sets are expanded by multi-scale dilated pooling to promote global interactions. Moreover, a dynamic contextual positional encoding module is exploited to add positional information more efficiently and flexibly. We conduct extensive experiments on various visual tasks and the experimental results strongly demonstrate the outperformance of our proposed LGViT to state-of-the-art approaches.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

LGViT: A Local and Global Vision Transformer with Dynamic Contextual Position Bias Using Overlapping Windows

Abstract

Talk to us

Similar Papers

More From: Applied Sciences

Lead the way for us

Journal: Applied Sciences	Publication Date: Feb 3, 2023
License type: CC BY 4.0

Similar Papers

Pancreatic cancer pathology image segmentation with channel and spatial long-range dependencies
Zhao-Min Chen ... Keqing Shi
Computers in Biology and Medicine | VOL. 169
Zhao-Min Chen, et. al.Zhao-Min Chen ... Keqing Shi
13 Dec 2023
Computers in Biology and Medicine | VOL. 169

Self-attention Mechanism at the Token Level: Gradient Analysis and Algorithm Optimization
Linqing Liu ... Xiaolong Xu
Knowledge-Based Systems | VOL. 277
Linqing Liu, et. al.Linqing Liu ... Xiaolong Xu
06 Jul 2023
Knowledge-Based Systems | VOL. 277

A Self-attention Based Model for Offline Handwritten Text Recognition
Nam Tuan Ly ... Trung Tan Ngo
-
Nam Tuan Ly, et. al.Nam Tuan Ly ... Trung Tan Ngo
01 Jan 2021
01 Jan 2021

A simple and efficient graph Transformer architecture for molecular properties prediction
Yunhua Lu ... Jiangling Tian
Chemical Engineering Science | VOL. 280
Yunhua Lu, et. al.Yunhua Lu ... Jiangling Tian
07 Jul 2023
Chemical Engineering Science | VOL. 280

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

LGViT: A Local and Global Vision Transformer with Dynamic Contextual Position Bias Using Overlapping Windows

Abstract

Talk to us

Similar Papers

More From: Applied Sciences