Abstract

Recently, stransformer has achieve remarkable performance on various computer vision tasks. Unfortunately, vision transformer suffers from high computational cost to calculate the pair-wise relations of patches for images, and the computations grows quadratically with the number of patches in the images, which blocks their deployment on resource-limited devices such as mobile phones and various IoT devices. In this paper, we find that there are large amount of redundant computations and communications in the self-attention operation of vision transformers because of the high degree patch locality, and present a hardware–software co-designed solution for vision transformer exploiting the patch locality, termed as DiVIT. DiVIT substantially reduces redundant computations and communications in vision transformers and improves the performance and energy efficiency. The experiments demonstrates that DiVIT achieves an average of 8.2× and 41.3× speedup and over three orders of magnitude improvements energy-efficiency over CPUs and GPUs on real world datasets.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call