PURF: Improving teacher representations by imposing smoothness constraints for knowledge distillation

Md Imtiaz Hossain,Sharmen Akhter,Choong Seon Hong,Eui-Nam Huh

doi:10.1016/j.asoc.2024.111579

Md Imtiaz Hossain, Sharmen Akhter + Show 2 more

https://doi.org/10.1016/j.asoc.2024.111579

Copy DOI

Export

Save

Cite

Journal: Applied Soft Computing	Publication Date: Apr 9, 2024
Citations: 1

Abstract
Full-Text
Similar Papers

Abstract

Listen

Knowledge distillation is one of the most persuasive approaches to model compression that transfers the representational expertise from large deep-learning teacher models to a small student network. Although numerous techniques have been proposed to improve teacher representations at the logits level, no study has examined the weaknesses in the representations of the teacher at the feature level during distillation. On the other hand, in a trained deep-learning model, all the kernels are not uniformly activated to make specific predictions. Transferring this knowledge may result in a student learning a suboptimal intrinsic distribution and restrict the existing distillation methods from exploiting their highest potential. Motivated by these issues, this study analyses the generalization capability of teachers with or without a uniformly activated channel distribution. Preliminary investigations and theoretical analyses show that partly uniforming or smoothing feature maps offer improved representation that enriches the generalization capability. Based on these observations, it is hypothesized that distillation-based explicit supervision using smoothed feature maps and cross-entropy loss plays a significant role in improving generalization. Hence, this paper proposes a novel technique called Partly Unified Recalibrated Feature (PURF) map distillation. The proposed method recalibrates the feature maps by intercommunicating the representational cues among nearest-neighbor channels. PURF increases the performance of state-of-the-art knowledge distillation (KD) methods across architectures by improving generalization, model compression, few-shot training, transferability, and robustness transfer on standard benchmark datasets. PURF achieves 1.51% average accuracy improvements on seven diverse architectures in image classification tasks. PURF increases the performance of the state-of-the-art knowledge distillation methods by an average accuracy of 1.91% across architectures. Moreover, PURF achieves an average of 2.02%, and 0.96% higher accuracy in transferability and robustness tasks, respectively, on standard benchmark datasets.

Full Text