Transformer-based networks have revolutionized visual tasks with their continuous innovation, leading to significant progress. However, the widespread adoption of Vision Transformers (ViT) is limited due to their high computational and parameter requirements, making them less feasible for resource-constrained mobile and edge computing devices. Moreover, existing lightweight ViTs exhibit limitations in capturing different granular features, extracting local features efficiently, and incorporating the inductive bias inherent in convolutional neural networks. These limitations somewhat impact the overall performance. To address these limitations, we propose an efficient ViT called Dual-Granularity Former (DGFormer). DGFormer mitigates these limitations by introducing two innovative modules: Dual-Granularity Attention (DG Attention) and Efficient Feed-Forward Network (Efficient FFN). In our experiments, on the image recognition task of ImageNet, DGFormer surpasses lightweight models such as PVTv2-B0 and Swin Transformer by 2.3% in terms of Top1 accuracy. On the object detection task of COCO, under RetinaNet detection framework, DGFormer outperforms PVTv2-B0 and Swin Transformer with increase of 0.5% and 2.4% in average precision (AP), respectively. Similarly, under Mask R-CNN detection framework, DGFormer exhibits improvement of 0.4% and 1.8% in AP compared to PVTv2-B0 and Swin Transformer, respectively. On the semantic segmentation task on the ADE20K, DGFormer achieves a substantial improvement of 2.0% and 2.5% in mean Intersection over Union (mIoU) over PVTv2-B0 and Swin Transformer, respectively. The code is open-source and available at: https://github.com/ISCLab-Bistu/DGFormer.git.
Read full abstract