BookKD: A novel knowledge distillation for reducing distillation costs by decoupling knowledge generation and learning
BookKD: A novel knowledge distillation for reducing distillation costs by decoupling knowledge generation and learning
- Conference Article
8
- 10.24963/ijcai.2021/444
- Aug 1, 2021
Knowledge distillation (KD) has recently emerged as an efficacious scheme for learning compact deep neural networks (DNNs). Despite the promising results achieved, the rationale that interprets the behavior of KD has yet remained largely understudied. In this paper, we introduce a novel task-oriented attention model, termed as KDExplainer, to shed light on the working mechanism underlying the vanilla KD. At the heart of KDExplainer is a Hierarchical Mixture of Experts (HME), in which a multi-class classification is reformulated as a multi-task binary one. Through distilling knowledge from a free-form pre-trained DNN to KDExplainer, we observe that KD implicitly modulates the knowledge conflicts between different subtasks, and in reality has much more to offer than label smoothing. Based on such findings, we further introduce a portable tool, dubbed as virtual attention module (VAM), that can be seamlessly integrated with various DNNs to enhance their performance under KD. Experimental results demonstrate that with a negligible additional cost, student models equipped with VAM consistently outperform their non-VAM counterparts across different benchmarks. Furthermore, when combined with other KD methods, VAM remains competent in promoting results, even though it is only motivated by vanilla KD. The code is available at https:// github.com/zju-vipa/KDExplainer.
- Conference Article
550
- 10.1109/cvpr42600.2020.00396
- Jun 1, 2020
Knowledge Distillation (KD) aims to distill the knowledge of a cumbersome teacher model into a lightweight student model. Its success is generally attributed to the privileged information on similarities among categories provided by the teacher model, and in this sense, only strong teacher models are deployed to teach weaker students in practice. In this work, we challenge this common belief by following experimental observations: 1) beyond the acknowledgment that the teacher can improve the student, the student can also enhance the teacher significantly by reversing the KD procedure; 2) a poorly-trained teacher with much lower accuracy than the student can still improve the latter significantly. To explain these observations, we provide a theoretical analysis of the relationships between KD and label smoothing regularization. We prove that 1) KD is a type of learned label smoothing regularization and 2) label smoothing regularization provides a virtual teacher model for KD. From these results, we argue that the success of KD is not fully due to the similarity information between categories from teachers, but also to the regularization of soft targets, which is equally or even more important. Based on these analyses, we further propose a novel Teacher-free Knowledge Distillation (Tf-KD) framework, where a student model learns from itself or manuallydesigned regularization distribution. The Tf-KD achieves comparable performance with normal KD from a superior teacher, which is well applied when a stronger teacher model is unavailable. Meanwhile, Tf-KD is generic and can be directly deployed for training deep neural networks. Without any extra computation cost, Tf-KD achieves up to 0.65\% improvement on ImageNet over well-established baseline models, which is superior to label smoothing regularization.
- Research Article
36
- 10.3390/s20164616
- Aug 17, 2020
- Sensors (Basel, Switzerland)
In this paper, we propose an efficient knowledge distillation method to train light networks using heavy networks for semantic segmentation. Most semantic segmentation networks that exhibit good accuracy are based on computationally expensive networks. These networks are not suitable for mobile applications using vision sensors, because computational resources are limited in these environments. In this view, knowledge distillation, which transfers knowledge from heavy networks acting as teachers to light networks as students, is suitable methodology. Although previous knowledge distillation approaches have been proven to improve the performance of student networks, most methods have some limitations. First, they tend to use only the spatial correlation of feature maps and ignore the relational information of their channels. Second, they can transfer false knowledge when the results of the teacher networks are not perfect. To address these two problems, we propose two loss functions: a channel and spatial correlation (CSC) loss function and an adaptive cross entropy (ACE) loss function. The former computes the full relationship of both the channel and spatial information in the feature map, and the latter adaptively exploits one-hot encodings using the ground truth labels and the probability maps predicted by the teacher network. To evaluate our method, we conduct experiments on scene parsing datasets: Cityscapes and Camvid. Our method presents significantly better performance than previous methods.
- Conference Article
3
- 10.1145/3394171.3413985
- Oct 12, 2020
Learning from music to visual storytelling of shots is an interesting and emerging task. It produces a coherent visual story in the form of a shot type sequence, which not only expands the storytelling potential for a song but also facilitates automatic concert video mashup process and storyboard generation. In this study, we present a deep interactive learning (DIL) mechanism for building a compact yet accurate sequence-to-sequence model to accomplish the task. Different from the one-way transfer between a pre-trained teacher network (or ensemble network) and a student network in knowledge distillation (KD), the proposed method enables collaborative learning between an ensemble teacher network and a student network. Namely, the student network also teaches. Specifically, our method first learns a teacher network that is composed of several assistant networks to generate a shot type sequence and produce the soft target (shot types) distribution accordingly through KD. It then constructs the student network that learns from both the ground truth label (hard target) and the soft target distribution to alleviate the difficulty of optimization and improve generalization capability. As the student network gradually advances, it turns to feed back knowledge to the assistant networks, thereby improving the teacher network in each iteration. Owing to such interactive designs, the DIL mechanism bridges the gap between the teacher and student networks and produces more superior capability for both networks. Objective and subjective experimental results demonstrate that both the teacher and student networks can generate more attractive shot sequences from music, thereby enhancing the viewing and listening experience.
- Conference Article
4
- 10.18653/v1/2022.emnlp-main.664
- Jan 1, 2022
Overconfidence has been shown to impair generalization and calibration of a neural network. Previous studies remedy this issue by adding a regularization term to a loss function, preventing a model from making a peaked distribution. Label smoothing smoothes target labels with a pre-defined prior label distribution; as a result, a model is learned to maximize the likelihood of predicting the soft label. Nonetheless, the amount of smoothing is the same in all samples and remains fixed in training. In other words, label smoothing does not reflect the change in probability distribution mapped by a model over the course of training. To address this issue, we propose a regularization scheme that brings dynamic nature into the smoothing parameter by taking model probability distribution into account, thereby varying the parameter per instance. A model in training self-regulates the extent of smoothing on the fly during forward propagation. Furthermore, inspired by recent work in bridging label smoothing and knowledge distillation, our work utilizes self-knowledge as a prior label distribution in softening target labels, and presents theoretical support for the regularization effect by knowledge distillation and the dynamic smoothing parameter. Our regularizer is validated comprehensively, and the result illustrates marked improvements in model generalization and calibration, enhancing robustness and trustworthiness of a model.
- Conference Article
- 10.1109/icipcn67432.2026.11438428
- Jan 27, 2026
Reliable, noninvasive staging of liver fibrosis (F0–F4) is crucial for guiding therapy while minimizing biopsies. This study presents a compact, deployable ultrasound pipeline that-to our knowledge, the first to apply knowledge distillation (KD) to liver fibrosis staging-distills a high-capacity teacher (ResNet-50) into lightweight students (MobileNetV3-Small, ResNet-18, EfficientNet-B0) suitable for bedside inference. Credible estimation is enforced via a leak-safe split that groups images by file-stem to prevent near-duplicates from straddling train/validation partitions. Training couples class-weighted cross-entropy, label smoothing, MixUp, and modest geometric/photometric augmentation with AdamW, ReduceLROnPlateau, and early stopping; KD uses temperature T =3 and mixing α=0.6. Primary reporting excludes test-time augmentation (No-TTA) and emphasizes accuracy, balanced accuracy, and macro-F1 to align selection with deployment. On a 6,323-image dataset, the teacher attains 98.74% accuracy, 98.12% balanced accuracy, and 0.9812 macro-F1. The best student-KD(ResNet-50→EfficientNet-B0)-achieves 98.19%/97.41%/0.9733 with only 4.01M parameters and ≈1.87 ms per image, while KD(ResNet-50→MobileNetV3-Small) reaches 98.03%/97.07%/0.9705 at 1.52M parameters and ≈0.42 ms. Distillation therefore preserves high discrimination while substantially reducing size and latency. Error analyses show uniformly high per-class recall with residual confusions concentrated between adjacent stages, supporting clinical plausibility. Overall, the contributions are a simple yet rigorous KD framework tailored to ultrasound staging, a leak-aware validation protocol that curbs optimistic bias, and deployment-ready students that provide a strong basis for comparative studies and prospective multi-center validation toward trustworthy point-of-care decision support.
- Research Article
- 10.29109/gujsc.1141648
- Sep 30, 2022
- Gazi Üniversitesi Fen Bilimleri Dergisi Part C: Tasarım ve Teknoloji
Deploying convolutional neural networks to mobile or embedded devices is often prohibited by limited memory and computational resources. This is particularly problematic for the most successful networks, which tend to be very large and require long inference times. Many alternative approaches have been developed for compressing neural networks based on pruning, regularization, quantization or distillation. In this paper, we propose the “Knowledge Distillation with Dynamic Pruning” (KDDP), which trains a dynamically pruned compact student network under the guidance of a large teacher network. In KDDP, we train the student network with supervision from the teacher network, while applying L1 regularization on the neuron activations in a fully-connected layer. Subsequently, we prune inactive neurons. Our method automatically determines the final size of the student model. We evaluate the compression rate and accuracy of the resulting networks on an image classification dataset, and compare them to results obtained by Knowledge Distillation (KD). Compared to KD, our method produces better accuracy and more compact models.
- Research Article
6
- 10.1016/j.ijar.2024.109301
- Oct 1, 2024
- International Journal of Approximate Reasoning
Uncertainty-based knowledge distillation for Bayesian deep neural network compression
- Conference Article
14
- 10.1109/sips50750.2020.9195219
- Sep 24, 2020
Knowledge distillation (KD) technique that utilizes a pretrained teacher model for training a student network is exploited for the optimization of quantized deep neural networks (QDNNs). We consider the choice of the teacher network and also investigate the effect of hyperparameters for KD. We employ several large floating-point and quantized models as the teacher network. The experiment shows that the softmax distribution produced by the teacher network is more important than its performance for effective KD training. Since the softmax distribution of the teacher network can be controlled by KD's hyperparameters, we analyze the interrelationship of each KD component for quantized DNN training. Our experiments show that even a small teacher model can achieve the same distillation performance as a large teacher model. We also propose the gradual soft loss reduction (GSLR) technique which controls the mixing ratio of hard and soft losses during training for robust KD based QDNN optimization.
- Research Article
12
- 10.1016/j.neucom.2024.127516
- Mar 5, 2024
- Neurocomputing
Multi-perspective analysis on data augmentation in knowledge distillation
- Research Article
- 10.1109/access.2024.3457859
- Jan 1, 2024
- IEEE Access
In recent years, there has been a growing interest in applying knowledge distillation (KD) techniques to the connectionist temporal classification (CTC) framework for training more efficient speech recognition models. Although conventional KD approaches have successfully reduced computational burden, they have limitations in dealing with the inconsistency problem caused by dropout regularization, particularly the gap between the training and inference stages. In the context of KD, this inconsistency may hinder the performance improvement of the student model. To overcome this issue, we propose a novel approach, namely Cons-KD, that combines KD and consistency regularization, where the former trains the student model to benefit from the knowledge of the teacher model, and the latter trains the student model to be more robust to the dropout-induced inconsistency. By directly mitigating the inconsistency problem, our KD framework can further improve the student’s performance compared to the vanilla KD. Experimental results on the LibriSpeech dataset demonstrate that Cons-KD significantly outperforms previous KD methods, improving the word error rate (WER) from 5.10 % to 4.13 % on the test-clean subset and from 12.87 % to 10.32 % on the test-other subset, respectively. These improvements correspond to relative error rate reduction (RERR) of 19.02 % and 19.81 %, respectively, implying notable advancements beyond conventional KD methods. Additionally, we conduct an in-depth analysis to verify the effect of each proposed objective.
- Conference Article
6
- 10.1109/wacv51458.2022.00142
- Jan 1, 2022
Knowledge distillation (KD) transfers knowledge of a teacher model to improve performance of a student model which is usually equipped with lower capacity. In the KD framework, however, it is unclear what kind of knowledge is effective and how it is transferred. This paper analyzes a KD process to explore the key factors. In a KD formulation, softmax temperature entangles three main components of student and teacher probabilities and a weight for KD, making it hard to analyze contributions of those factors separately. We disentangle those components so as to further analyze especially the temperature and improve the components respectively. Based on the analysis about temperature and uniformity of the teacher probability, we propose a method, called extractive distillation, for extracting effective knowledge from the teacher model. The extractive KD touches only teacher knowledge, thus being applicable to various KD methods. In the experiments on image classification tasks using Cifar-100 and TinyImageNet datasets, we demonstrate that the proposed method outperforms the other KD methods and analyze feature representation to show its effectiveness in the framework of transfer learning.
- Conference Article
7
- 10.1109/icaice54393.2021.00127
- Nov 1, 2021
Model fusion can effectively improve the effect of model prediction, but it will bring about an increase in time. In this paper, the dual-stage progressive knowledge distillation is improved in combination with multi-teacher knowledge distillation technology. A simple and effective multi-teacher's Softtarget integration method is proposed in multi-teacher network knowledge distillation. Improve the guiding role of excellent models in knowledge distillation. Dual-stage progressive knowledge distillation is a method for small sample knowledge distillation. A progressive network grafting method is used to realize knowledge distillation in a small sample environment. In the first step, the student blocks are grafted one by one onto the teacher network and intertwined with other teacher blocks for training, and the training process only updates the parameters of the grafted blocks. In the second step, the trained student blocks are grafted onto the teacher network in turn, so that the learned student blocks adapt to each other and finally replace the teacher network to obtain a lighter network structure. Using Softtarget acquired by this method in Dual-stage progressive knowledge distillation instead of Hardtarget training, excellent results were obtained on BreakHis data sets.
- Research Article
9
- 10.1109/tcds.2022.3232569
- Sep 1, 2023
- IEEE Transactions on Cognitive and Developmental Systems
As a simple yet effective model compression method, knowledge distillation (or KD) is used to learn a small lightweight student network by transferring valuable knowledge from a pretrained cumbersome teacher network. However, existing KD methods usually consider the feature knowledge either in different layers or individual samples, failing to explore more detailed information in different channels from the perspective of sample relationships. Meanwhile, the negative influences contained in the teacher knowledge are also not well investigated, especially, when using the response-based knowledge. To address the above-mentioned issues, we devise a novel KD approach entitled channel correlation-based selective KD (or CCSKD). Specifically, to distill rich knowledge from feature representations, we not only consider the feature knowledge from different channels for individual samples but also take into account the relational knowledge based on per-channel features for different samples. Furthermore, to further distill positive response-based knowledge, a selective strategy is developed, i.e., selective KD, to progressively correct the negative influences from the teacher knowledge during the distillation process. We perform extensive experiments on three image classification data sets, CIFAR-100, Stanford Cars, and Tiny-ImageNet, to demonstrate the effectiveness of the proposed CCSKD, which outperforms recent state-of-the-art methods with a clear margin. Our codes are publicly available at <uri xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">https://github.com/gjplab/CCSKD</uri> .
- Book Chapter
3
- 10.1007/978-3-031-26284-5_24
- Jan 1, 2023
Knowledge Distillation (KD) is a compression framework that transfers distilled knowledge from a teacher to a smaller student model. KD approaches conventionally address problem domains where the teacher and student network have equal numbers of classes for classification. We provide a knowledge distillation solution tailored for class specialization, where the user requires a compact and performant network specializing in a subset of classes from the class set used to train the teacher model. To this end, we introduce a novel knowledge distillation framework, Class Specialized Knowledge Distillation (CSKD), that combines two loss functions: Renormalized Knowledge Distillation (RKD) and Intra-Class Variance (ICV) to render a computationally-efficient, specialized student network. We report results on several popular architectural benchmarks and tasks. In particular, CSKD consistently demonstrates significant performance improvements over teacher models for highly restrictive specialization tasks (e.g., instances where the number of subclasses or datasets is relatively small), in addition to outperforming other state-of-the-art knowledge distillation approaches for class specialization tasks.KeywordsNeural network compressionClass specializationKnowledge distillation