Knowledge distillation (KD) and model averaging (MA) are prominent techniques for enhancing the efficiency and effectiveness of deep neural networks (DNNs). MA produces an average model by averaging multiple checkpoints along the trajectory of the stochastic gradient descent (SGD), which tends to converge toward the flatter side of the local minimum (Fig. 2). However, MA operates offline and does not influence the specific local minimum to which the network converges. By contrast, KD transfers knowledge from a teacher model to a student model, guiding the student toward a better local minimum. Despite this, KD can still result in a poor convergence location within the minimum if the teacher over/under-regularizes the student model's behavior. By combining KD and MA, we aim to leverage KD for a favorable local minimum and MA for a robust convergence location within the local minimum. However, our results revealed that these two techniques were incompatible. This study empirically analyzed the causes of this incompatibility and proposed an annealing knowledge distillation (AKD) scheme to address it. Building on MA and AKD, we introduced a general self-distillation framework referred to as self-distillation with model averaging (SDMA). To the best of our knowledge, SDMA is the first general distillation framework that aims to achieve a favorable local minimum and a robust convergence location. Extensive experiments demonstrated that SDMA significantly improved the generalization performance of DNNs.
Read full abstract