Obtain Dark Knowledge via Extended Knowledge Distillation
Training a smaller student model on portable devices such as smart phones to mimic a complex and heavy teacher model through knowledge distillation has received extensive attention. Massive studies have been done on the application of knowledge distillation on various tasks as well as improvement of supervised information while training student model. However, few studies focus on the distillation of "dark knowledge" can be found in teacher model but hard to be expressed directly, which is very important because the training data used to train the teacher model are not always visible to the student model. We extended the method of knowledge distillation in this paper, not only taking the difference of logits between the teacher model and the student model as part of loss function as the basic knowledge distillation method did, but also paying attention to the interior of both models. We divided both teacher model and student model into several segments and made the outputs of these segments as close as possible to form another part of loss function, and this method was referred to as "Extended-KD" (Extended Knowledge Distillation). In our experiment, we used complete CIFAR-10 dataset to train student model as baseline, and then we tried to drop all examples of some labels to train student model through Extended-KD. Our experiment shows that Extended-KD method performs better than the basic knowledge distillation method; and knowledge distillation with incomplete datasets can also enable student model to predict the target labels it has never seen. Therefore, Extended-KD method can obtain dark knowledge properly.
- Research Article
122
- 10.1016/j.media.2022.102693
- Feb 1, 2023
- Medical Image Analysis
SSD-KD: A self-supervised diverse knowledge distillation method for lightweight skin lesion classification using dermoscopic images.
- Research Article
6
- 10.1016/j.dsp.2024.104512
- Apr 17, 2024
- Digital Signal Processing
Discretization and decoupled knowledge distillation for arbitrary oriented object detection
- Research Article
5
- 10.1016/j.asoc.2024.111579
- Apr 9, 2024
- Applied Soft Computing
PURF: Improving teacher representations by imposing smoothness constraints for knowledge distillation
- Conference Article
4
- 10.1145/3589334.3645440
- May 13, 2024
Unsupervised semantic hashing has emerged as an indispensable technique for fast image search, which aims to convert images into binary hash codes without relying on labels. Recent advancements in the field demonstrate that employing large-scale backbones (e.g., ViT) in unsupervised semantic hashing models can yield substantial improvements. However, the inference delay has become increasingly difficult to overlook. Knowledge distillation provides a means for practical model compression to alleviate this delay. Nevertheless, the prevailing knowledge distillation approaches are not explicitly designed for semantic hashing. They ignore the unique search paradigm of semantic hashing, the inherent necessities of the distillation process, and the property of hash codes. In this paper, we propose an innovative Bit-mask Robust Contrastive knowledge Distillation (BRCD) method, specifically devised for the distillation of semantic hashing models. To ensure the effectiveness of two kinds of search paradigms in the context of semantic hashing, BRCD first aligns the semantic spaces between the teacher and student models through a contrastive knowledge distillation objective. Additionally, to eliminate noisy augmentations and ensure robust optimization, a cluster-based method within the knowledge distillation process is introduced. Furthermore, through a bit-level analysis, we uncover the presence of redundancy bits resulting from the bit independence property. To mitigate these effects, we introduce a bit mask mechanism in our knowledge distillation objective. Finally, extensive experiments not only showcase the noteworthy performance of our BRCD method in comparison to other knowledge distillation methods but also substantiate the generality of our methods across diverse semantic hashing models and backbones. The code for BRCD is available at https://github.com/hly1998/BRCD.
- Research Article
8
- 10.1016/j.csl.2023.101583
- Nov 9, 2023
- Computer Speech & Language
Dual Knowledge Distillation for neural machine translation
- Preprint Article
1
- 10.20944/preprints202503.0903.v1
- Mar 13, 2025
- Preprints.org
As deep learning models are widely applied across various domains, a critical challenge is how to compress models while maintaining high reasoning capability. Knowledge distillation, an effective technique for model compression, has been used to enhance the performance of lightweight models. However, traditional distillation methods are limited when dealing with complex reasoning tasks. Reinforcement learning (RL) offers a novel approach to knowledge distillation by optimizing the reasoning strategies of teacher models, generating more efficient decision paths, and providing more valuable learning content for student models. This paper reviews the latest advancements in combining reinforcement learning with knowledge distillation, focusing on policy distillation, value function distillation, and dynamic reward-guided distillation methods. It also discusses the challenges faced by RL-driven distillation, such as simplifying complex strategies, addressing temporal dependencies, and balancing exploration and exploitation, and suggests possible solutions. Finally, this paper explores the applications of RL-driven knowledge distillation in fields such as game AI, robotic control, and dialogue systems, and outlines future research directions, including automated distillation, multimodal distillation, and challenges in federated learning.
- Preprint Article
1
- 10.20944/preprints202503.0903.v2
- Mar 27, 2025
- Preprints.org
As deep learning models are widely applied across various domains, a critical challenge is how to compress models while maintaining high reasoning capability. Knowledge distillation, an effective technique for model compression, has been used to enhance the performance of lightweight models. However, traditional distillation methods are limited when dealing with complex reasoning tasks. Reinforcement learning (RL) offers a novel approach to knowledge distillation by optimizing the reasoning strategies of teacher models, generating more efficient decision paths, and providing more valuable learning content for student models. This paper reviews the latest advancements in combining reinforcement learning with knowledge distillation, focusing on policy distillation, value function distillation, and dynamic reward-guided distillation methods. It also discusses the challenges faced by RL-driven distillation, such as simplifying complex strategies, addressing temporal dependencies, and balancing exploration and exploitation, and suggests possible solutions. Finally, this paper explores the applications of RL-driven knowledge distillation in fields such as game AI, robotic control, and dialogue systems, and outlines future research directions, including automated distillation, multimodal distillation, and challenges in federated learning.
- Conference Article
42
- 10.1109/mvip49855.2020.9116923
- Feb 1, 2020
Knowledge distillation (KD) is a new method for transferring knowledge of a structure under training to another one. The conventional application of KD is in the form of learning a small model (named as a student) by soft labels produced by a complex model (named as a teacher). Due to the novel idea introduced in KD, recently, its notion is used in different methods such as compression and processes that are going to enhance the model accuracy. Although different techniques are proposed in the area of KD, there is a lack of a model to generalize KD techniques. In this paper, various studies in the scope of KD are investigated and analyzed to build a general model for KD. All the methods and techniques in KD can be summarized through the proposed model. By utilizing the proposed model, different methods in KD are better investigated and explored. The advantages and disadvantages of different approaches in KD can be better understood and developing a new strategy for KD can be possible. Using the proposed model, different KD methods are represented in an abstract view.
- Book Chapter
6
- 10.1007/978-3-030-86520-7_16
- Jan 1, 2021
Knowledge distillation (KD) is one of the most efficient methods to compress a large deep neural network (called teacher) to a smaller network (called student). Current state-of-the-art KD methods assume that the distributions of training data of teacher and student are identical to maintain the student’s accuracy close to the teacher’s accuracy. However, this strong assumption is not met in many real-world applications where the distribution mismatch happens between teacher’s training data and student’s training data. As a result, existing KD methods often fail in this case. To overcome this problem, we propose a novel method for KD process, which is still effective when the distribution mismatch happens. We first learn a distribution based on student’s training data, from which we can sample images well-classified by the teacher. By doing this, we can discover the data space where the teacher has good knowledge to transfer to the student. We then propose a new loss function to train the student network, which achieves better accuracy than the standard KD loss function. We conduct extensive experiments to demonstrate that our method works well for KD tasks with or without distribution mismatch. To the best of our knowledge, our method is the first method addressing the challenge of distribution mismatch when performing KD process.
- Conference Article
1
- 10.1145/3442555.3442581
- Nov 27, 2020
Knowledge distillation is dedicated to improving the performance of light weight networks by transferring knowledge during the training process. Meanwhile, it is important to apply knowledge distillation on different situations. The previous knowledge distillation method with adversarial samples uses a traditional knowledge distillation loss to let the student learn a good decision boundary. In this paper, we propose a novel method named Adversarial Metric Knowledge Distillation (AMKD), which utilizes adversarial samples to transfer the dark knowledge from the teacher to student. We select adversarial samples which are close to the decision boundary of two classes to metric the distance with negative class samples employing triplet loss constraint. The method guarantees the student network learning relationships among samples by quantitative metric learning. Therefore, we not only transfer information of the decision boundary but also ensure the student network can always maintain a proper distance from other negative classes. This can be another good exploration for knowledge distillation with adversarial samples. The experiments on CIFAR-10, CIFAR-100 and Tiny ImageNet datasets verify that the proposed knowledge distillation method works effectively on improving the student network performance.
- Research Article
2
- 10.3390/app14083284
- Apr 13, 2024
- Applied Sciences
Knowledge distillation based on the features from the penultimate layer allows the student (lightweight model) to efficiently mimic the internal feature outputs of the teacher (high-capacity model). However, the training data may not conform to the ground-truth distribution of images in terms of classes and features. We propose two knowledge distillation algorithms to solve the above problem from the directions of fitting the ground-truth distribution of classes and fitting the ground-truth distribution of features, respectively. The former uses teacher labels to supervise student classification output instead of dataset labels, while the latter designs feature temperature parameters to correct teachers’ abnormal feature distribution output. We conducted knowledge distillation experiments on the ImageNet-2012 and Cifar-100 datasets using seven sets of homogeneous models and six sets of heterogeneous models. The experimental results show that our proposed algorithms improve the performance of penultimate layer feature knowledge distillation and outperform other existing knowledge distillation methods in terms of classification performance and generalization ability.
- Research Article
12
- 10.1016/j.neucom.2024.127516
- Mar 5, 2024
- Neurocomputing
Multi-perspective analysis on data augmentation in knowledge distillation
- Conference Article
2
- 10.1145/3701716.3717645
- May 8, 2025
Federated Learning (FL) has achieved significant popularity in privacy-preserving distributed learning, wherein data remains on edge devices, ensuring data security and user privacy. Regardless of its advantages, FL has considerable challenges, including the non-independent and identically distributed (non-IID) nature of data across clients, degradation in performance compared to centralized learning methods, and communication efficiency issues due to frequent exchanges of large model updates between clients and the server. To overcome these challenges, FL combined with knowledge distillation (KD), which is a decentralized learning methodology that facilitates collaborative model training across several devices or clients with data privacy. Traditional KD generally focuses on the transfer of knowledge through logits; however, this methodology omits the importance of intermediate feature representations within the model. To address this limitation, we propose incorporating Shapley Additive Explanations (SHAP) or Shapley values into knowledge distillation (KD) methods. Shapley values quantify feature importance, enabling the transfer of critical feature contributions, and thereby enhancing the effectiveness of KD. In this work, we propose a novel decentralized machine learning approach, named FedKDShap, refined federated learning with Shapley values-informed KD which prioritizes performance, interpretability, resource efficiency, and feature importance within the distillation process to optimize knowledge transfer from a high-capacity teacher model to a lightweight student model. This integration not only reduces communication demands on resource-constrained devices but also enhances model convergence in non-IID data settings by embedding Shapley values into the KD loss function. Our experiment leverages benchmark datasets to simulate real-world non-IID data distribution which also demonstrates that the FedKDShap method enhances model accuracy and outperforms state-of-the-art architectures. Our code repository is available at: https://github.com/shadhin39/FedKDShap
- Research Article
47
- 10.1007/s11263-023-01792-z
- Apr 25, 2023
- International Journal of Computer Vision
Knowledge distillation is a simple yet effective technique for deep model compression, which aims to transfer the knowledge learned by a large teacher model to a small student model. To mimic how the teacher teaches the student, existing knowledge distillation methods mainly adapt an unidirectional knowledge transfer, where the knowledge extracted from different intermedicate layers of the teacher model is used to guide the student model. However, it turns out that the students can learn more effectively through multi-stage learning with a self-reflection in the real-world education scenario, which is nevertheless ignored by current knowledge distillation methods. Inspired by this, we devise a new knowledge distillation framework entitled multi-target knowledge distillation via student self-reflection or MTKD-SSR, which can not only enhance the teacher’s ability in unfolding the knowledge to be distilled, but also improve the student’s capacity of digesting the knowledge. Specifically, the proposed framework consists of three target knowledge distillation mechanisms: a stage-wise channel distillation (SCD), a stage-wise response distillation (SRD), and a cross-stage review distillation (CRD), where SCD and SRD transfer feature-based knowledge (i.e., channel features) and response-based knowledge (i.e., logits) at different stages, respectively; and CRD encourages the student model to conduct self-reflective learning after each stage by a self-distillation of the response-based knowledge. Experimental results on five popular visual recognition datasets, CIFAR-100, Market-1501, CUB200-2011, ImageNet, and Pascal VOC, demonstrate that the proposed framework significantly outperforms recent state-of-the-art knowledge distillation methods.
- Research Article
4
- 10.3390/a15050160
- May 11, 2022
- Algorithms
Large-scale automatic speech recognition model has achieved impressive performance. However, huge computational resources and massive amount of data are required to train an ASR model. Knowledge distillation is a prevalent model compression method which transfers the knowledge from large model to small model. To improve the efficiency of knowledge distillation for end-to-end speech recognition especially in the low-resource setting, a Mixup-based Knowledge Distillation (MKD) method is proposed which combines Mixup, a data-agnostic data augmentation method, with softmax-level knowledge distillation. A loss-level mixture is presented to address the problem caused by the non-linearity of label in the KL-divergence when adopting Mixup to the teacher–student framework. It is mathematically shown that optimizing the mixture of loss function is equivalent to optimize an upper bound of the original knowledge distillation loss. The proposed MKD takes the advantage of Mixup and brings robustness to the model even with a small amount of training data. The experiments on Aishell-1 show that MKD obtains a 15.6% and 3.3% relative improvement on two student models with different parameter scales compared with the existing methods. Experiments on data efficiency demonstrate MKD achieves similar results with only half of the original dataset.