Revisiting Label Smoothing Regularization with Knowledge Distillation

Jiyue Wang,Pei Zhang,Qianhua He,Yongjian Hu,Yanxiong Li

doi:10.3390/app11104699

Abstract

Label Smoothing Regularization (LSR) is a widely used tool to generalize classification models by replacing the one-hot ground truth with smoothed labels. Recent research on LSR has increasingly focused on the correlation between the LSR and Knowledge Distillation (KD), which transfers the knowledge from a teacher model to a lightweight student model by penalizing their output’s Kullback–Leibler-divergence. Based on this observation, a Teacher-free Knowledge Distillation (Tf-KD) method was proposed in previous work. Instead of a real teacher model, a handcrafted distribution similar to LSR was used to guide the student learning. Tf-KD is a promising substitute for LSR except for its hard-to-tune and model-dependent hyperparameters. This paper develops a new teacher-free framework LSR-OS-TC, which decomposes the Tf-KD method into two components: model Output Smoothing (OS) and Teacher Correction (TC). Firstly, the LSR-OS extends the LSR method to the KD regime and applies a softer temperature to the model output softmax layer. Output smoothing is critical for stabilizing the KD hyperparameters among different models. Secondly, in the TC part, a larger proportion is assigned to the uniform distribution teacher’s right class to provide a more informative teacher. The two-component method was evaluated exhaustively on the image (dataset CIFAR-100, CIFAR-10, and CINIC-10) and audio (dataset GTZAN) classification tasks. The results showed that LSR-OS can improve LSR performance independently with no extra computational cost, especially on several deep neural networks where LSR is ineffective. The further training boost by the TC component showed the effectiveness of our two-component strategy. Overall, LSR-OS-TC is a practical substitution of LSR that can be tuned on one model and directly applied to other models compared to the original Tf-KD method.

Highlights

Deep learning has been a story of booms of success; yet, as the network becomes deeper and wider, the model consumes more and more computational resources [1,2,3].There is a trend to use light models with fewer parameters to save memory and accelerate learning and inferring speed [4,5,6,7]
Reference [13] showed that a soft temperature larger than one is critical for the effectiveness of Knowledge Distillation (KD), so we extended the KL-loss to a generalized form and put forward the Label Smoothing Regularization (LSR)-Output Smoothing (OS) component: L LSR−OS ( p, q) = (1 − α) × LCE ( p(1), q) + α × τ 2 × LKL (u|| p(τ ))
Teacher-free KD (Tf-KD) needs several runs to confirm the best hyperparameters for every specific model, whereas our method LSR-KD-Teacher Correction (TC) can work consistently on different models with the same parameters

Summary

Introduction

Deep learning has been a story of booms of success; yet, as the network becomes deeper and wider, the model consumes more and more computational resources [1,2,3].There is a trend to use light models with fewer parameters to save memory and accelerate learning and inferring speed [4,5,6,7]. Besides the traditional classification cross-entropy error (Figure 1b), a Kullback–Leibler (KL)-divergence loss is penalized between a pre-trained teacher and the student model during the student training time. By minimizing KL-divergence loss, a student model can mimic the inter-class relationship of the teacher prediction. In addition to its many application in model compression, KD is used to boost network training with multiple models with an identical architecture [14,15] or a single-model self-distillation [16,17,18]

Methods

Results

Conclusion