Abstract

Remarkable achievements have been obtained by deep neural networks in the last several years. However, the breakthrough in neural networks accuracy is always accompanied by explosive growth of computation and parameters, which leads to a severe limitation of model deployment. In this paper, we propose a novel knowledge distillation technique named self-distillation to address this problem. Self-distillation attaches several attention modules and shallow classifiers at different depths of neural networks and distills knowledge from the deepest classifier to the shallower classifiers. Different from the conventional knowledge distillation methods where the knowledge of the teacher model is transferred to another student model, self-distillation can be considered as knowledge transfer in the same model - from the deeper layers to the shallow layers. Moreover, the additional classifiers in self-distillation allow the neural network to work in a dynamic manner, which leads to a much higher acceleration. Experiments demonstrate that self-distillation has consistent and significant effectiveness on various neural networks and datasets. On average, 3.49 and 2.32 percent accuracy boost are observed on CIFAR100 and ImageNet. Besides, experiments show that self-distillation can be combined with other model compression methods, including knowledge distillation, pruning and lightweight model design.

Highlights

  • D Eep convolutional neural networks have shown promising results in many applications such as image classification [1], [2], [3], [4], object detection [5], [6], [7], [8] and segmentation [9], [10], [11]

  • We have proposed a novel distillation method named self-distillation, which leads to benefits on model accuracy, model acceleration and model compression simultaneously

  • The original neural network is modified to a multi-exit neural network by introducing additional shallow classifiers at different depths

Read more

Summary

Introduction

D Eep convolutional neural networks have shown promising results in many applications such as image classification [1], [2], [3], [4], object detection [5], [6], [7], [8] and segmentation [9], [10], [11]. To achieve a good performance, modern convolutional neural networks always require a tremendous amount of computation and storage, which has severely limited their deployments on resource-limited devices and real-time applications. In recent years, this problem has been extensively explored and numerous model compression and acceleration methods have been proposed to address this issue. Knowledge distillation is one of the most effective approaches, which firstly trains an over-parametrized neural network as a teacher and trains a small student network to mimic the output of the teacher network. Researchers find that the choice of teacher models has a great impact on the accuracy of student models and the teacher with the highest accuracy is not the best teacher for distillation [32],

Methods
Results
Discussion
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call