In knowledge distillation (KD), a lightweight student model yields enhanced test accuracy by mimicking the behavior of a pre-trained large model (teacher). However, the cumbersome teacher model often makes over-confident responses, resulting in poor generalization when presented with unseen data. Consequently, a student trained by such a teacher also inherits this problem. To mitigate this issue, in this paper, we present a new framework of KD dubbed coded knowledge distillation (CKD) in which the student is trained to mimic instead the behavior of a coded teacher. Compared to the teacher in KD, the coded teacher in CKD has an additional adaptive encoding layer in the front, which adaptively encodes an input image into a compressed version (using JPEG encoding for instance) and then feeds the compressed input image to the pre-trained teacher. Comprehensive experimental results show the effectiveness of CKD over KD. In addition, we extend the deployment of a coded teacher to other knowledge transfer methods, showcasing its ability to enhance test accuracy across these methods.
Read full abstract