Abstract

In adapting Transformer from language to computer vision, the major obstacles are the high computational complexity and large model size of Transformer blocks, derived from the great quantity of visual tokens and high resolutions of input entities. To address these challenges, this paper presents a mixture lightweight Transformer (MLT) backbone for image understanding, where each Transformer block, called SH-Transformer, adopts Single-Head Self-Attention (SHSA) and Convolutional Inception Module (CIM). Unlike previous Transformers that compute Multi-Head Self-Attention (MHSA), SHSA limits the representation of input token into a single head, resulting in a low-dimensional embedding that greatly reduces computational complexity. In spite of adding a small number of model parameters, SHSA greatly minimizes the number of input tokens. As complementary of SHSA that only investigates global interactions, CIM is designed to explore multi-scale local information using lightweight convolutions in a multi-path parallel manner. Experimental results reveal that MLT yields competitive or state-of-the-art results with respect to recent transformers while keeping smaller model size and lower computational costs for different visual tasks, including image classification, semantic segmentation and object detection. Particularly, the proposed method has 4.2% improvement to the tiny version of Pyramid Vision Transformer on image classification of top-1 accuracy on ImageNet-1K.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call