Modeling Teacher-Student Techniques in Deep Neural Networks for Knowledge Distillation

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon

Knowledge distillation (KD) is a new method for transferring knowledge of a structure under training to another one. The conventional application of KD is in the form of learning a small model (named as a student) by soft labels produced by a complex model (named as a teacher). Due to the novel idea introduced in KD, recently, its notion is used in different methods such as compression and processes that are going to enhance the model accuracy. Although different techniques are proposed in the area of KD, there is a lack of a model to generalize KD techniques. In this paper, various studies in the scope of KD are investigated and analyzed to build a general model for KD. All the methods and techniques in KD can be summarized through the proposed model. By utilizing the proposed model, different methods in KD are better investigated and explored. The advantages and disadvantages of different approaches in KD can be better understood and developing a new strategy for KD can be possible. Using the proposed model, different KD methods are represented in an abstract view.

Similar Papers
  • Research Article
  • Cite Count Icon 3294
  • 10.1007/s11263-021-01453-z
Knowledge Distillation: A Survey
  • Mar 22, 2021
  • International Journal of Computer Vision
  • Jianping Gou + 3 more

In recent years, deep neural networks have been successful in both industry and academia, especially for computer vision tasks. The great success of deep learning is mainly due to its scalability to encode large-scale data and to maneuver billions of model parameters. However, it is a challenge to deploy these cumbersome deep models on devices with limited resources, e.g., mobile phones and embedded devices, not only because of the high computational complexity but also the large storage requirements. To this end, a variety of model compression and acceleration techniques have been developed. As a representative type of model compression and acceleration, knowledge distillation effectively learns a small student model from a large teacher model. It has received rapid increasing attention from the community. This paper provides a comprehensive survey of knowledge distillation from the perspectives of knowledge categories, training schemes, teacher-student architecture, distillation algorithms, performance comparison and applications. Furthermore, challenges in knowledge distillation are briefly reviewed and comments on future research are discussed and forwarded.

  • Research Article
  • 10.1109/tpami.2025.3647862
Forget Me Not: Fighting Local Overfitting With Knowledge Fusion and Distillation.
  • Jan 1, 2025
  • IEEE transactions on pattern analysis and machine intelligence
  • Uri Stern + 2 more

Overfitting in deep neural networks occurs less frequently than expected. This is a puzzling observation, as theory predicts that greater model capacity should eventually lead to overfitting - yet this is rarely seen in practice. But what if overfitting does occur, not globally, but in specific sub-regions of the data space? In this work, we introduce a novel score that measures the forgetting rate of deep models on validation data, capturing what we term local overfitting: a performance degradation confined to certain regions of the input space. We demonstrate that local overfitting can arise even without conventional overfitting, and is closely linked to the double descent phenomenon. Building on these insights, we introduce a two-stage approach that leverages the training history of a single model to recover and retain forgotten knowledge: first, by aggregating checkpoints into an ensemble, and then by distilling it into a single model of the original size, thus enhancing performance without added inference cost. Extensive experiments across multiple datasets, modern architectures, and training regimes validate the effectiveness of our approach. Notably, in the presence of label noise, our method - Knowledge Fusion followed by Knowledge Distillation - outperforms both the original model and independently trained ensembles, achieving a rare win-win scenario: reduced training and inference complexity.

  • Conference Article
  • Cite Count Icon 1
  • 10.1109/aiam48774.2019.00106
Obtain Dark Knowledge via Extended Knowledge Distillation
  • Oct 1, 2019
  • Chen Yuan + 1 more

Training a smaller student model on portable devices such as smart phones to mimic a complex and heavy teacher model through knowledge distillation has received extensive attention. Massive studies have been done on the application of knowledge distillation on various tasks as well as improvement of supervised information while training student model. However, few studies focus on the distillation of "dark knowledge" can be found in teacher model but hard to be expressed directly, which is very important because the training data used to train the teacher model are not always visible to the student model. We extended the method of knowledge distillation in this paper, not only taking the difference of logits between the teacher model and the student model as part of loss function as the basic knowledge distillation method did, but also paying attention to the interior of both models. We divided both teacher model and student model into several segments and made the outputs of these segments as close as possible to form another part of loss function, and this method was referred to as "Extended-KD" (Extended Knowledge Distillation). In our experiment, we used complete CIFAR-10 dataset to train student model as baseline, and then we tried to drop all examples of some labels to train student model through Extended-KD. Our experiment shows that Extended-KD method performs better than the basic knowledge distillation method; and knowledge distillation with incomplete datasets can also enable student model to predict the target labels it has never seen. Therefore, Extended-KD method can obtain dark knowledge properly.

  • Research Article
  • Cite Count Icon 24
  • 10.1016/j.ins.2021.08.020
Distilling from professors: Enhancing the knowledge distillation of teachers
  • Aug 11, 2021
  • Information Sciences
  • Duhyeon Bang + 2 more

Distilling from professors: Enhancing the knowledge distillation of teachers

  • PDF Download Icon
  • Conference Article
  • Cite Count Icon 67
  • 10.18653/v1/2021.eacl-main.212
Annealing Knowledge Distillation
  • Jan 1, 2021
  • Aref Jafari + 3 more

Significant memory and computational requirements of large deep neural networks restricts their application on edge devices. Knowledge distillation (KD) is a prominent model compression technique for deep neural networks in which the knowledge of a trained large teacher model is transferred to a smaller student model. The success of knowledge distillation is mainly attributed to its training objective function, which exploits the soft-target information (also known as “dark knowledge”) besides the given regular hard labels in a training set. However, it is shown in the literature that the larger the gap between the teacher and the student networks, the more difficult is their training using knowledge distillation. To address this shortcoming, we propose an improved knowledge distillation method (called Annealing-KD) by feeding the rich information provided by teacher’s soft-targets incrementally and more efficiently. Our Annealing-KD technique is based on a gradual transition over annealed soft-targets generated by the teacher at different temperatures in an iterative process; and therefore, the student is trained to follow the annealed teacher output in a step-by-step manner. This paper includes theoretical and empirical evidence as well as practical experiments to support the effectiveness of our Annealing-KD method. We did a comprehensive set of experiments on different tasks such as image classification (CIFAR-10 and 100) and NLP language inference with BERT-based models on the GLUE benchmark and consistently got superior results.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 3
  • 10.1038/s41598-025-91152-3
Counterclockwise block-by-block knowledge distillation for neural network compression
  • Apr 3, 2025
  • Scientific Reports
  • Xiaowei Lan + 6 more

Model compression is a technique for transforming large neural network models into smaller ones. Knowledge distillation (KD) is a crucial model compression technique that involves transferring knowledge from a large teacher model to a lightweight student model. Existing knowledge distillation methods typically facilitate the knowledge transfer from teacher to student models in one or two stages. This paper introduces a novel approach called counterclockwise block-wise knowledge distillation (CBKD) to optimize the knowledge distillation process. The core idea of CBKD aims to mitigate the generation gap between teacher and student models, facilitating the transmission of intermediate-layer knowledge from the teacher model. It divides both teacher and student models into multiple sub-network blocks, and in each stage of knowledge distillation, only the knowledge from one teacher sub-block is transferred to the corresponding position of a student sub-block. Additionally, in the CBKD process, deeper teacher sub-network blocks are assigned higher compression rates. Extensive experiments on tiny-imagenet-200 and CIFAR-10 demonstrate that the proposed CBKD method can enhance the distillation performance of various mainstream knowledge distillation approaches.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 2
  • 10.3390/s24051612
Minimalist Deployment of Neural Network Equalizers in a Bandwidth-Limited Optical Wireless Communication System with Knowledge Distillation.
  • Mar 1, 2024
  • Sensors
  • Yiming Zhu + 4 more

An equalizer based on a recurrent neural network (RNN), especially with a bidirectional gated recurrent unit (biGRU) structure, is a good choice to deal with nonlinear damage and inter-symbol interference (ISI) in optical communication systems because of its excellent performance in processing time series information. However, its recursive structure prevents the parallelization of the computation, resulting in a low equalization rate. In order to improve the speed without compromising the equalization performance, we propose a minimalist 1D convolutional neural network (CNN) equalizer, which is reconverted from a biGRU with knowledge distillation (KD). In this work, we applied KD to regression problems and explain how KD helps students learn from teachers in solving regression problems. In addition, we compared the biGRU, 1D-CNN after KD and 1D-CNN without KD in terms of Q-factor and equalization velocity. The experimental data showed that the Q-factor of the 1D-CNN increased by 1 dB after KD learning from the biGRU, and KD increased the RoP sensitivity of the 1D-CNN by 0.89 dB with the HD-FEC threshold of 1 × 10-3. At the same time, compared with the biGRU, the proposed 1D-CNN equalizer reduced the computational time consumption by 97% and the number of trainable parameters by 99.3%, with only a 0.5 dB Q-factor penalty. The results demonstrate that the proposed minimalist 1D-CNN equalizer holds significant promise for future practical deployments in optical wireless communication systems.

  • Research Article
  • Cite Count Icon 2
  • 10.1177/15501477211057037
Feature fusion-based collaborative learning for knowledge distillation
  • Nov 1, 2021
  • International Journal of Distributed Sensor Networks
  • Yiting Li + 4 more

Deep neural networks have achieved a great success in a variety of applications, such as self-driving cars and intelligent robotics. Meanwhile, knowledge distillation has received increasing attention as an effective model compression technique for training very efficient deep models. The performance of the student network obtained through knowledge distillation heavily depends on whether the transfer of the teacher’s knowledge can effectively guide the student training. However, most existing knowledge distillation schemes require a large teacher network pre-trained on large-scale data sets, which can increase the difficulty of knowledge distillation in different applications. In this article, we propose a feature fusion-based collaborative learning for knowledge distillation. Specifically, during knowledge distillation, it enables networks to learn from each other using the feature/response-based knowledge in different network layers. We concatenate the features learned by the teacher and the student networks to obtain a more representative feature map for knowledge transfer. In addition, we also introduce a network regularization method to further improve the model performance by providing a positive knowledge during training. Experiments and ablation studies on two widely used data sets demonstrate that the proposed method, feature fusion-based collaborative learning, significantly outperforms recent state-of-the-art knowledge distillation methods.

  • Conference Article
  • Cite Count Icon 3
  • 10.1109/slt.2018.8639545
Efficient Building Strategy with Knowledge Distillation for Small-Footprint Acoustic Models
  • Dec 1, 2018
  • Takafumi Moriya + 7 more

In this paper, we propose a novel training strategy for deep neural network (DNN) based small-footprint acoustic models. The accuracy of DNN-based automatic speech recognition (ASR) systems can be greatly improved by leveraging large amounts of data to improve the level of expression. DNNs use many parameters to enhance recognition performance. Unfortunately, resource-constrained local devices are unable to run complex DNN-based ASR systems. For building compact acoustic models, the knowledge distillation (KD) approach is often used. KD uses a large, well-trained model that outputs target labels to train a compact model. However, the standard KD cannot fully utilize the large model outputs to train compact models because the soft logits provide only rough information. We assume that the large model must give more useful hints to the compact model. We propose an advanced KD that uses mean squared error to minimize the discrepancies between the final hidden layer outputs. We evaluate our proposal on recorded speech data sets assuming car-and home-use scenarios, and show that our models achieve lower character error rates than the conventional KD approach or from-scratch training on computation resource-constrained devices.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 45
  • 10.1007/s40747-020-00248-y
Knowledge from the original network: restore a better pruned network with knowledge distillation
  • Jan 10, 2021
  • Complex & Intelligent Systems
  • Liyang Chen + 3 more

To deploy deep neural networks to edge devices with limited computation and storage costs, model compression is necessary for the application of deep learning. Pruning, as a traditional way of model compression, seeks to reduce the parameters of model weights. However, when a deep neural network is pruned, the accuracy of the network will significantly decrease. The traditional way to decrease the accuracy loss is fine-tuning. When over many parameters are pruned, the pruned network’s capacity is reduced heavily and cannot recover to high accuracy. In this paper, we apply the knowledge distillation strategy to abate the accuracy loss of pruned models. The original network of the pruned network was used as the teacher network, aiming to transfer the dark knowledge from the original network to the pruned sub-network. We have applied three mainstream knowledge distillation methods: response-based knowledge, feature-based knowledge, and relation-based knowledge (Gou et al. in Knowledge distillation: a survey. arXiv:200605525, 2020), and compare the result to the traditional fine-tuning method with grand-truth labels. Experiments have been done on the CIFAR100 dataset with several deep convolution neural network. Results show that the pruned network recovered by knowledge distillation with its original network performs better accuracy than it recovered by fine-tuning with sample labels. It has also been validated in this paper that the original network as the teacher performs better than differently structured networks with same accuracy as the teacher.

  • Conference Article
  • 10.1109/nss/mic44867.2021.9875699
Knowledge Distillation: A Strategy to Enhance the Performance of Deep Learning-based Seminal Segmentation
  • Oct 16, 2021
  • Reza Karimzadeh + 3 more

Accurate segmentation of target tissues/structures as well as surrounding healthy organs/tissues (organs at risk (OARs)) plays a critical role in radiation therapy and treatment planning. Accurate segmentation of OARs prevents/minimizes unwanted toxicity to healthy tissues. Manual segmentation is time-consuming, tedious, prone to human errors and subject to intra- and inter-observer variability. In this regard, deep learning algorithms have shown extraordinary performance in automated organ segmentation from medical images, though the powerful/highly effective models might be computationally intensive. Transferring knowledge from complex/cumbersome models to simple/versatile models, known as knowledge distillation, has been proposed to address this issue (enhance the performance of the existing deep learning models). In this work, the impact of the knowledge distillation on OARs segmentation from CT images is investigated for commonly used Unet and Resnet deep learning models. To this end, a highly complex Unet model (as teacher) and two conventional deep learning models (Unet and Resnet as students) were developed to delineate OARs on thoracic CT images from the SegTHOR public dataset. The models were trained once independently and once through knowledge distillation from the teacher model to the student models. The teacher model yielded segmentation accuracy in terms of Dice coefficient of 0.95 and 0.86 for the heart and aorta compared to the student models (Unet and Resnet) which achieved an accuracy of 0.79, 0.64 and 0.13, 0.22, respectively. After knowledge distillation from the teacher to the students, the accuracy of the Unet and Resnet improved to 0.91, 0.79 and 0.62, 0.63 for the heart and aorta, respectively. This study demonstrated the beneficial impact of knowledge distillation to enhance the overall performance of conventional models without increasing the computational their complexity.

  • Conference Article
  • 10.1145/3665689.3665713
Research on Lightweight Spine X-ray Image Segmentation Algorithm Based on Knowledge Distillation
  • Jan 26, 2024
  • Dong Qing + 1 more

Spinal X-ray images find crucial roles in medical image processing, enabling precise spinal structure localization, detecting ischemia or lesions, and aiding surgical planning for clinical diagnosis and treatment. Rapid and accurate image segmentation, a vital aspect, saves significant time and effort for medical professionals. Knowledge distillation, a model compression technique, transfers insights from a complex teacher model to a simpler student model, reducing size and computational demands while maintaining performance. This paper proposes a lightweight image segmentation method using knowledge distillation. A teacher network integrates a channel attention mechanism for enhanced feature modeling, and residual connections improve feature propagation. The student network adopts a Mobile U-Net architecture for reduced complexity and computational costs. Through knowledge distillation, the student network inherits insights from the teacher, achieving efficient segmentation. Experimental results demonstrate its superiority over traditional U-Net and Mobile U-Net in spinal image segmentation, showcasing its potential for clinical applications.

  • Research Article
  • Cite Count Icon 3
  • 10.1016/j.displa.2023.102543
Trained teacher: Who is good at teaching
  • Sep 20, 2023
  • Displays
  • Xingzhu Liang + 5 more

Trained teacher: Who is good at teaching

  • Book Chapter
  • Cite Count Icon 10
  • 10.1016/b978-0-32-385787-1.00013-0
Chapter 8 - Knowledge distillation
  • Jan 1, 2022
  • Deep Learning for Robot Perception and Cognition
  • Nikolaos Passalis + 2 more

Chapter 8 - Knowledge distillation

  • Conference Article
  • Cite Count Icon 550
  • 10.1109/cvpr42600.2020.00396
Revisiting Knowledge Distillation via Label Smoothing Regularization
  • Jun 1, 2020
  • Li Yuan + 4 more

Knowledge Distillation (KD) aims to distill the knowledge of a cumbersome teacher model into a lightweight student model. Its success is generally attributed to the privileged information on similarities among categories provided by the teacher model, and in this sense, only strong teacher models are deployed to teach weaker students in practice. In this work, we challenge this common belief by following experimental observations: 1) beyond the acknowledgment that the teacher can improve the student, the student can also enhance the teacher significantly by reversing the KD procedure; 2) a poorly-trained teacher with much lower accuracy than the student can still improve the latter significantly. To explain these observations, we provide a theoretical analysis of the relationships between KD and label smoothing regularization. We prove that 1) KD is a type of learned label smoothing regularization and 2) label smoothing regularization provides a virtual teacher model for KD. From these results, we argue that the success of KD is not fully due to the similarity information between categories from teachers, but also to the regularization of soft targets, which is equally or even more important. Based on these analyses, we further propose a novel Teacher-free Knowledge Distillation (Tf-KD) framework, where a student model learns from itself or manuallydesigned regularization distribution. The Tf-KD achieves comparable performance with normal KD from a superior teacher, which is well applied when a stronger teacher model is unavailable. Meanwhile, Tf-KD is generic and can be directly deployed for training deep neural networks. Without any extra computation cost, Tf-KD achieves up to 0.65\% improvement on ImageNet over well-established baseline models, which is superior to label smoothing regularization.

Save Icon
Up Arrow
Open/Close
Notes

Save Important notes in documents

Highlight text to save as a note, or write notes directly

You can also access these Documents in Paperpal, our AI writing tool

Powered by our AI Writing Assistant