What Makes for Hierarchical Vision Transformer?

Yuxin Fang,Xinggang Wang,Wenyu Liu,Rui Wu

doi:10.1109/tpami.2023.3282019

Abstract

Recent studies indicate that hierarchical Vision Transformer (ViT) with a macro architecture of interleaved non-overlapped window-based self-attention & shifted-window operation can achieve state-of-the-art performance in various visual recognition tasks, and challenges the ubiquitous convolutional neural networks (CNNs) using densely slid kernels. In most recently proposed hierarchical ViTs, self-attention is the de-facto standard for spatial information aggregation. In this paper, we question whether self-attention is the only choice for hierarchical ViT to attain strong performance, and study the effects of different kinds of cross-window communication methods. To this end, we replace self-attention layers with embarrassingly simple linear mapping layers, and the resulting proof-of-concept architecture termed TransLinear can achieve very strong performance in ImageNet-[Formula: see text] image recognition. Moreover, we find that TransLinear is able to leverage the ImageNet pre-trained weights and demonstrates competitive transfer learning properties on downstream dense prediction tasks such as object detection and instance segmentation. We also experiment with other alternatives to self-attention for content aggregation inside each non-overlapped window under different cross-window communication approaches. Our results reveal that the macro architecture, other than specific aggregation layers or cross-window communication mechanisms, is more responsible for hierarchical ViT's strong performance and is the real challenger to the ubiquitous CNN's dense sliding window paradigm.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

What Makes for Hierarchical Vision Transformer?

Abstract

Talk to us

Similar Papers

More From: IEEE Transactions on Pattern Analysis and Machine Intelligence

Lead the way for us

Journal: IEEE Transactions on Pattern Analysis and Machine Intelligence	Publication Date: Oct 1, 2023
Citations: 5

Similar Papers

3D face-model reconstruction from a single image: A feature aggregation approach using hierarchical transformer with weak supervision
Shubhajit Basak ... Michael Schukat
Neural Networks | VOL. 156
Shubhajit Basak, et. al.Shubhajit Basak ... Michael Schukat
01 Oct 2022
Neural Networks | VOL. 156

Image Detection in Noisy Images
Kushagra Yadav ... Dakshit Mohan
-
Kushagra Yadav, et. al.Kushagra Yadav ... Dakshit Mohan
06 May 2021
06 May 2021

Simple-Encoded evolving convolutional neural network and its application to skin disease image classification
Xiaoyu He ... Weihong Huang
Swarm and Evolutionary Computation | VOL. 67
Xiaoyu He, et. al.Xiaoyu He ... Weihong Huang
01 Dec 2021
Swarm and Evolutionary Computation | VOL. 67

Contour-Based Wild Animal Instance Segmentation Using a Few-Shot Detector.
Jiaxi Tang ... Yaqin Zhao
Animals | VOL. 12
Jiaxi Tang, et. al.Jiaxi Tang ... Yaqin Zhao
04 Aug 2022
Animals | VOL. 12

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

What Makes for Hierarchical Vision Transformer?

Abstract

Talk to us

Similar Papers

More From: IEEE Transactions on Pattern Analysis and Machine Intelligence