TSNet: Token Sparsification for Efficient Video Transformer

Hao Wang,Wenjia Zhang,Guohua Liu

doi:10.3390/app131910633

Hao Wang, Wenjia Zhang + Show 1 more

Open Access

https://doi.org/10.3390/app131910633

Copy DOI

Abstract

In the domain of video recognition, video transformers have demonstrated remarkable performance, albeit at significant computational cost. This paper introduces TSNet, an innovative approach for dynamically selecting informative tokens from given video samples. The proposed method involves a lightweight prediction module that assigns importance scores to each token in the video. Tokens with top scores are then utilized for self-attention computation. We apply the Gumbel-softmax technique to sample from the output of the prediction module, enabling end-to-end optimization of the prediction module. We aim to extend our method on hierarchical vision transformers rather than single-scale vision transformers. We use a simple linear module to project the pruned tokens, and the projected result is then concatenated with the output of the self-attention network to maintain the same number of tokens while capturing interactions with the selected tokens. Since feedforward networks (FFNs) contribute significant computation, we also propose linear projection for the pruned tokens to accelerate the model, and the existing FFN layer progresses the selected tokens. Finally, in order to ensure that the structure of the output remains unchanged, the two groups of tokens are reassembled based on their spatial positions in the original feature map. The experiments conducted primarily focus on the Kinetics-400 dataset using UniFormer, a hierarchical video transformer backbone that incorporates convolution in its self-attention block. Our model demonstrates comparable results to the original model while reducing computation by over 13%. Notably, by hierarchically pruning 70% of input tokens, our approach significantly decreases 55.5% of the FLOPs, while the decline in accuracy is confined to 2%. Additional testing of wide applicability and adaptability with other transformers such as the Video Swin Transformer was also performed and indicated its progressive potentials in video recognition benchmarks. By implementing our token sparsification framework, video vision transformers can achieve a remarkable balance between enhanced computational speed and a slight reduction in accuracy.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Applied Sciences	Publication Date: Sep 24, 2023
Citations: 1	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

TSNet: Token Sparsification for Efficient Video Transformer

Abstract

Talk to us

Similar Papers

More From: Applied Sciences

Lead the way for us

Similar Papers

How Writing Came About
Edward Wachtel
Technology and Culture | VOL. 40
Edward WachtelEdward Wachtel
01 Jan 1998
Technology and Culture | VOL. 40

White blood cell detection based on improved YOLOv5s
Lixia Cao ... Limin Liu
-
Lixia Cao, et. al.Lixia Cao ... Limin Liu
04 Aug 2022
04 Aug 2022

Human action recognition in videos using structure similarity of aligned motion images
Salim Al Ali ... Mariofonna Milanova
International Journal of Reasoning-based Intelligent Systems | VOL. 6
Salim Al Ali, et. al.Salim Al Ali ... Mariofonna Milanova
01 Jan 2014
International Journal of Reasoning-based Intelligent Systems | VOL. 6

Joint Segmentation and Identification Feature Learning for Occlusion Face Recognition.
Baojin Huang ... Zhongyuan Wang
IEEE Transactions on Neural Networks and Learning Systems | VOL. 34
Baojin Huang, et. al.Baojin Huang ... Zhongyuan Wang
01 Dec 2023
IEEE Transactions on Neural Networks and Learning Systems | VOL. 34

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

TSNet: Token Sparsification for Efficient Video Transformer

Abstract

Talk to us

Similar Papers

More From: Applied Sciences