Abstract
Vision Transformer (ViT) has demonstrated excellent accuracy in image recognition and has been actively studied in various fields. However, ViT requires a large matrix multiplication called Attention, which is computationally expensive. Since the computational cost of Self-Attention used in ViT increases quadratically with the number of tokens, research to reduce the computational cost by pruning the number of tokens has been active in recent years. To prune tokens, it is necessary to set the pruning rate, and in many studies, the pruning rate is set manually. However, it is difficult to manually determine the optimal pruning rate because the appropriate pruning rate varies from task to task. In this study, we propose a method to solve this problem. The proposed pruning rate adjustment adjusts the pruning rate so that the training loss is converged by Gradient-Aware Scaling (GAS). In addition, we propose Variable Proportional Attention (VPA) for Top-K, a general-purpose token pruning method, to mitigate the performance loss due to pruning. For the CIFAR-10 dataset, several competitive pruning methods improve recognition accuracy over manually setting the pruning rate; eTPS+Adjust on Hybrid ViT-S achieves 99.01% Accuracy with -31.68% FLOPs. Furthermore, Top-K+VPA outperforms token merging when the pruning rate is large for trained ViT-L inference on ImageNet-1k and has superior scalability in the Accuracy-Latency relation. In particular, when Top-K+VPA is applied to ViT-L on a GPU environment with a pruning rate of 6%, it achieves 80.62% Accuracy on the ImageNet-1k dataset with -50.44% FLOPs and -46.8% Latency.
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have