Abstract

Vision transformers have been successfully applied to image recognition tasks due to their ability to capture long-range dependencies within an image. However, they still suffer from weak local feature extraction, easy loss of channel interaction information in one-dimensional multi-head self-attention modeling, and large number of parameters. This paper proposes a lightweight image classification hybrid architecture named EPSViTs (Efficient Parameter Shared Transformer, EPSViTs). Firstly, a new local feature extraction module is designed to effectively enhance the expression of local features. Secondly, using the parameter sharing approach, a lightweight multi-head self-attention module based on information interaction is designed, which can globally model the image from both spatial and channel dimensions, and mine the potential correlation of the image in space and channel. Extensive experiments are conducted on three public datasets, a subset of ImageNet, Cifar100 and APTOS2019, a private dataset Mushroom66, and the results show that the hybrid architecture EPSViTs proposed in this paper based on parameter sharing for multi-head self-attentive image classification has obvious advantages, especially on a subset of ImageNet to reach 89.18%, which is a 3.8% improvement compared to Edgevits_xxs, verifying the effectiveness of the model.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call