Abstract

Transformers with long-range dependency and data specificity act as an effective means of classifying insect pests in agricultural engineering. Although many methods have been proposed to confine the range of self-attention within a local region to reduce the computation complexity, none of them can reduce the number of model parameters. Moreover, the self-attention mechanism usually causes query tokens to focus excessively on image patches, which limits the effective receptive field and the long-range dependence. To address these issues, this paper establishes a novel Dilated-Windows-based Vision Transformer with Efficient-Suppressive-self-attention (DWViT-ES) architecture, which includes efficient-self-attention (ESA), dilated window (DW), and suppressive-self-attention (SSA) as its core components. The ESA simplifies the successive linear Transformations to reduce the number of model parameters and computational costs. Meanwhile, the DW and SSA expand the effective receptive field of self-attention mechanism to prevent query tokens from focusing on similar and close regions, thereby preventing the loss of useful information. Finally, experiments show that the DWViT-ES only has 19.6 M parameters and 3.5G FLOPs (over 20% reductions vs. 19.6 M and 4.5G of Swin-T). Meanwhile, the DWViT-ES training from scratch has 71.6% top-1 accuracy on the IP102 dataset (2.4% absolute improvement of Swin-T); after fine-tuning on Imagenet-1K, the DWViT-ES achieves 76.0% and 78.7% top-1 accuracy on IP102 and CPB (0.1% and 0.9% absolute improvement of Swin-T), respectively. Meanwhile, practical deployment on a mobile-embedded device is presented, which validates the feasibility of the DWViT-ES.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call