Abstract

In a wide range of dense prediction tasks, large-scale Vision Transformers have achieved state-of-the-art performance while requiring expensive computation. In contrast to most existing approaches accelerating Vision Transformers for image classification, we focus on accelerating Vision Transformers for dense prediction without any fine-tuning. We present two non-parametric operators specialized for dense prediction tasks, a token clustering layer to decrease the number of tokens for expediting and a token reconstruction layer to increase the number of tokens for recovering high-resolution. To accomplish this, the following steps are taken: i) token clustering layer is employed to cluster the neighboring tokens and yield low-resolution representations with spatial structures; ii) the following transformer layers are performed only to these clustered low-resolution tokens; and iii) reconstruction of high-resolution representations from refined low-resolution representations is accomplished using token reconstruction layer. The proposed approach shows promising results consistently on 6 dense prediction tasks, including object detection, semantic segmentation, panoptic segmentation, instance segmentation, depth estimation, and video instance segmentation. Additionally, we validate the effectiveness of the proposed approach on the very recent state-of-the-art open-vocabulary recognition methods. Furthermore, a number of recent representative approaches are benchmarked and compared on dense prediction tasks.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call