Multi-tailed vision transformer for efficient inference

Yunke Wang,Bo Du,Wenyuan Wang,Chang Xu

doi:10.1016/j.neunet.2024.106235

Abstract

Recently, Vision Transformer (ViT) has achieved promising performance in image recognition and gradually serves as a powerful backbone in various vision tasks. To satisfy the sequential input of Transformer, the tail of ViT first splits each image into a sequence of visual tokens with a fixed length. Then, the following self-attention layers construct the global relationship between tokens to produce useful representation for the downstream tasks. Empirically, representing the image with more tokens leads to better performance, yet the quadratic computational complexity of self-attention layer to the number of tokens could seriously influence the efficiency of ViT’s inference. For computational reduction, a few pruning methods progressively prune uninformative tokens in the Transformer encoder, while leaving the number of tokens before the Transformer untouched. In fact, fewer tokens as the input for the Transformer encoder can directly reduce the following computational cost. In this spirit, we propose a Multi-Tailed Vision Transformer (MT-ViT) in the paper. MT-ViT adopts multiple tails to produce visual sequences of different lengths for the following Transformer encoder. A tail predictor is introduced to decide which tail is the most efficient for the image to produce accurate prediction. Both modules are optimized in an end-to-end fashion, with the Gumbel-Softmax trick. Experiments on ImageNet-1K demonstrate that MT-ViT can achieve a significant reduction on FLOPs with no degradation of the accuracy and outperform compared methods in both accuracy and FLOPs.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Multi-tailed vision transformer for efficient inference

Abstract

Talk to us

Similar Papers

More From: Neural networks : the official journal of the International Neural Network Society

Lead the way for us

Similar Papers

Research on Ultrasonic Image Recognition Based on Optimization Immune Algorithm.
Xueqiang Zeng ... Sufen Chen
Computational and mathematical methods in medicine | VOL. 2021
Xueqiang Zeng, et. al.Xueqiang Zeng ... Sufen Chen
17 May 2021
Computational and mathematical methods in medicine | VOL. 2021

Reservoir Computing with Untrained Convolutional Neural Networks for Image Recognition
Zhiqiang Tong ... Gouhei Tanaka
-
Zhiqiang Tong, et. al.Zhiqiang Tong ... Gouhei Tanaka
01 Aug 2018
01 Aug 2018

A method based on the Levenshtein distance metric for the comparison of multiple movement patterns described by matrix sequences of different length
Jasper Beernaerts ... Nico Van De Weghe
Expert Systems with Applications | VOL. 115
Jasper Beernaerts, et. al.Jasper Beernaerts ... Nico Van De Weghe
10 Aug 2018
Expert Systems with Applications | VOL. 115

Complete Complementary Sequences of Different Length
R.S Raja Durai ... C Han
IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences | VOL. E90-A
R.S Raja Durai, et. al.R.S Raja Durai ... C Han
01 Jul 2007
IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences | VOL. E90-A

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Multi-tailed vision transformer for efficient inference

Abstract

Talk to us

Similar Papers

More From: Neural networks : the official journal of the International Neural Network Society