Distilling from professors: Enhancing the knowledge distillation of teachers

Duhyeon Bang,Jongwuk Lee,Hyunjung Shim

doi:10.1016/j.ins.2021.08.020

Abstract

Knowledge distillation (KD) is a successful technique for transferring knowledge from one machine learning model to another model. Specifically, the idea of KD has been widely used for various tasks such as model compression and knowledge transfer between different models. However, existing studies in KD have overlooked the possibility that dark knowledge (i.e., soft targets) obtained from a complex and large model (a.k.a., a teacher model) may be either incorrect or insufficient. Such knowledge can hinder the effective learning of another small model (a.k.a., a student model). In this paper, we propose the professor model, which refines the soft target from the teacher model to improve KD. The professor model aims to achieve two goals; 1) improving the prediction accuracy and 2) capturing the inter-class correlation of the soft target from the teacher model. We first design the professor model by reformulating a conditional adversarial autoencoder (CAAE). Then, we devise two KD strategies using both teacher and professor models. Our empirical study demonstrates that the professor model effectively improves KD in three benchmark datasets: CIFAR100, TinyImagenet, and ILSVRC2015. Moreover, our comprehensive analysis shows that the professor model is much more effective than employing the stronger teacher model, in which parameters are greater than the sum of the teacher’s and professor’s parameters. Since the proposed model is model-agnostic, our model can be combined with any KD algorithm and consistently improves various KD techniques.

Full Text