SAVE: Encoding spatial interactions for vision transformers

Xiao Ma,Zetian Zhang,Rong Yu,Zexuan Ji,Mingchao Li,Yuhan Zhang,Qiang Chen

doi:10.1016/j.imavis.2024.105312

Abstract

Transformers have achieved impressive performance in visual tasks. Position encoding, which equips vectors (elements of input tokens, queries, keys, or values) with sequence specificity, effectively alleviates the lack of permutation relation in transformers. In this work, we first clarify that both position encoding and additional position-specific operations will introduce positional information when participating in self-attention. On this basis, most existing position encoding methods are equivalent to special affine transformations. However, this encoding method lacks the correlation of vector content interaction. We further propose Spatial Aggregation Vector Encoding (SAVE) that employs transition matrices to recombine vectors. We design two simple yet effective modes to merge other vectors, with each one serving as an anchor. The aggregated vectors control spatial contextual connections by establishing two-dimensional relationships. Our SAVE can be plug-and-play in vision transformers, even with other position encoding methods. Comparative results on three image classification datasets show that the proposed SAVE performs comparably to current position encoding methods. Experiments on detection tasks show that the SAVE improves the downstream performance of transformer-based methods. Code is available at https://github.com/maxiao0234/SAVE.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

SAVE: Encoding spatial interactions for vision transformers

Abstract

Talk to us

Similar Papers

More From: Image and Vision Computing

Lead the way for us

Similar Papers

Machine Perception Point Cloud Quality Assessment Via Vision Tasks
Jiapeng Lu ... Linyao Gao
-
Jiapeng Lu, et. al.Jiapeng Lu ... Linyao Gao
15 Dec 2020
15 Dec 2020

Possibility to use divergent tasks for baseline alpha rhythm modulation in older adults
Evgeniya Privodnova ... Victoriya Bilik
-
Evgeniya Privodnova, et. al.Evgeniya Privodnova ... Victoriya Bilik
01 Jul 2020
01 Jul 2020

Robust Encoding of Spatial Information in Orbitofrontal Cortex and Striatum.
Seng Bum Michael Yoo ... Benjamin Y Hayden
Journal of Cognitive Neuroscience | VOL. 30
Seng Bum Michael Yoo, et. al.Seng Bum Michael Yoo ... Benjamin Y Hayden
21 Mar 2018
Journal of Cognitive Neuroscience | VOL. 30

Stabilized display of coronary x-ray image sequences
Robert A Close ... Xiaolin Da
-
Robert A Close, et. al.Robert A Close ... Xiaolin Da
05 May 2004
05 May 2004

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

SAVE: Encoding spatial interactions for vision transformers

Abstract

Talk to us

Similar Papers

More From: Image and Vision Computing