Multi-perspective analysis on data augmentation in knowledge distillation

Wei Li,Shitong Shao,Ziming Qiu,Aiguo Song

doi:10.1016/j.neucom.2024.127516

Abstract

Knowledge distillation stands as a capable technique for transferring knowledge from a larger to a smaller model, thereby notably enhancing the smaller model’s performance. In the recent past, data augmentation has been employed in contrastive learning based knowledge distillation techniques yielding superior results. Despite the significant role of data augmentation, its value remains underappreciated within the domain of knowledge distillation, with no in-depth analysis in the literature thus far. To make up for this oversight, we conduct a multi-perspective theoretical and experimental analysis on the role that data augmentation can play in knowledge distillation. We summarize the properties of data augmentation and list the core findings as follows. (a) Our investigations validate that data augmentation significantly boosts the performance of knowledge distillation on the tasks of image classification and object detection. And this holds true even if the teacher model lacks comprehensive information about the augmented samples. Moreover, our novel Joint Data Augmentation (JDA) approach outperforms single data augmentation in knowledge distillation. (b) The pivotal role of data augmentation in knowledge distillation can be theoretically explained via Sharpness-Aware Minimization. (c) The compatibility of data augmentation with various knowledge distillation methods can enhance their performance. In light of these observations, we propose a new method called Cosine Confidence Distillation (CCD) for more reasonable knowledge transfer from augmented samples. Experimental results not only demonstrate that CCD becomes the state-of-the-art method with less storage requirement on CIFAR-100 and ImageNet-1k, but also validate the superiority of CCD over DIST on the object detection benchmark dataset, MS-COCO.

Full Text