Abstract

Recently, vision Transformers (ViTs) have achieved remarkable progress for image classification. As the computational cost of self-attention adopted in ViTs is quadratic with respect to the number of input tokens, some window-based ViTs have been proposed to solve this issue. However, these methods limit the computation of self-attention into spatial-constrained local windows, losing capability to encode image-based global interactions. Additionally, using fixed-size window always suffers the limitation of single-scale representation that is unsuitable for object recognition with variable scales. To address these problems, this paper describes a Pyramid Window-based Lightweight Transformer, namely PWLT, for image classification. Specifically, to address the need for multi-scale information, we employ windows of different sizes to encode objects with varying scales. To restore the relationships between different windows and explore global context, we introduce a dual self-attention scheme that utilizes local-to-global attention to reestablish these relationships. The extensive experiments on ImageNet-1K and CIFAR100 datasets demonstrate the effectiveness of our PWLT for image classification.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call