DctViT: Discrete Cosine Transform meet vision transformers

Keke Su,Lihua Cao,Botong Zhao,Ning Li,Di Wu,Xiyu Han,Yangfan Liu

doi:10.1016/j.neunet.2024.106139

Abstract

Vision transformers (ViTs) have become one of the dominant frameworks for vision tasks in recent years because of their ability to efficiently capture long-range dependencies in image recognition tasks using self-attention. In fact, both CNNs and ViTs have advantages and disadvantages in vision tasks, and some studies suggest that the use of both may be an effective way to balance performance and computational cost. In this paper, we propose a new hybrid network based on CNN and transformer, using CNN to extract local features and transformer to capture long-distance dependencies. We also proposed a new feature map resolution reduction based on Discrete Cosine Transform and self-attention, named DCT-Attention Down-sample (DAD). Our DctViT-L achieves 84.8% top-1 accuracy on ImageNet 1K, far outperforming CMT, Next-ViT, SpectFormer and other state-of-the-art models, with lower computational costs. Using DctViT-B as the backbone, RetinaNet can achieve 46.8% mAP on COCO val2017, which improves mAP by 2.5% and 1.1% with less calculation cost compared with CMT-S and SpectFormer as the backbone.

Full Text