Annealing Knowledge Distillation

Aref Jafari,Pranav Sharma,Ali Ghodsi,Mehdi Rezagholizadeh

doi:10.18653/v1/2021.eacl-main.212

Abstract

Significant memory and computational requirements of large deep neural networks restricts their application on edge devices. Knowledge distillation (KD) is a prominent model compression technique for deep neural networks in which the knowledge of a trained large teacher model is transferred to a smaller student model. The success of knowledge distillation is mainly attributed to its training objective function, which exploits the soft-target information (also known as “dark knowledge”) besides the given regular hard labels in a training set. However, it is shown in the literature that the larger the gap between the teacher and the student networks, the more difficult is their training using knowledge distillation. To address this shortcoming, we propose an improved knowledge distillation method (called Annealing-KD) by feeding the rich information provided by teacher’s soft-targets incrementally and more efficiently. Our Annealing-KD technique is based on a gradual transition over annealed soft-targets generated by the teacher at different temperatures in an iterative process; and therefore, the student is trained to follow the annealed teacher output in a step-by-step manner. This paper includes theoretical and empirical evidence as well as practical experiments to support the effectiveness of our Annealing-KD method. We did a comprehensive set of experiments on different tasks such as image classification (CIFAR-10 and 100) and NLP language inference with BERT-based models on the GLUE benchmark and consistently got superior results.

Highlights

We see that Annealing knowledge distillation (KD) consistently outperforms the other techniques both on dev set as well as the General Language Understanding Evaluation (GLUE) leaderboard
When we reduce the size of the student to a 4 layer model (BERT-Small), we notice almost twice as big of a gap in the average score over Vanilla KD when compared with DistilRoBERTa (Table 3)
We discussed that the difference between the capacity of the teacher and student models in knowledge distillation may hamper its performance

Summary

Introduction

Despite the great success of deep neural networks in many challenging tasks such as natural language processing (Vaswani et al, 2017; Liu et al, 2019), computer vision (Wong et al, 2019; Howard et al, 2017), and speech processing (Chan et al, 2016; He et al, 2019), these state-of-the-art networks are usually heavy to be deployed on edge devices with limited computational power (Bie et al, 2019; Lioutas et al, 2019). The problem of network over-parameterization and expensive computational complexity of deep networks can be addressed by neural model compression. There are abundant of neural model compression techniques in the literature (Prato et al, 2019; Tjandra et al, 2018; Jacob et al, 2018), among which knowledge distillation (KD) is one of the most prominent techniques (Hinton et al, 2015). Patient KD (Sun et al, 2019), TinyBERT (Jiao et al, 2019), and MobileBERT (Sun et al, 2020) are designed for distilling the knowledge of BERT-based teachers to a smaller student

Methods

Results

Discussion

Conclusion