Patient teacher can impart locality to improve lightweight vision transformer on small dataset

Jun Ling,Xuan Zhang,Fei Du,Linyu Li,Weiyi Shang,Chen Gao,Tong Li

doi:10.1016/j.patcog.2024.110893

Abstract

Vision Transformer (ViT) has achieved unprecedented success in vision tasks with the assistance of abundant data. However, the lack of inductive bias in lightweight ViT makes learning locality challenging on small datasets, leading to poor performance. This limitation impedes the application of lightweight ViT in scenarios with limited datasets and computational power. Knowledge Distillation (KD) allows student models to benefit from the teacher model. However, in the progressive learning stage, traditional single-stage KD methods are usually suboptimal for delivering fixed knowledge to the growing student model. To address these issues, we propose a simple yet effective two-stage KD method called Curriculum Information Knowledge Distillation (CIKD) for the first time. Specifically, we incorporate a curriculum learning framework, progressing from easy to difficult, in the KD curriculum. At the first stage, i.e., Attention Locality Imitation (ALI), the student model learns locality from the low-level semantic features of the teacher model through self-attention distillation. Afterward, at the second stage, i.e., Logit Mimicking (LM), the student model learns label information and high-level semantic logit from the teacher model. Without bells and whistles, our approach achieves state-of-the-art results on 8 small-scale datasets with ViT-Tiny (5.0M). Our code and model weights are available at: https://github.com/newLLing/CIKD.

Full Text