Forget Me Not: Fighting Local Overfitting With Knowledge Fusion and Distillation.
Overfitting in deep neural networks occurs less frequently than expected. This is a puzzling observation, as theory predicts that greater model capacity should eventually lead to overfitting - yet this is rarely seen in practice. But what if overfitting does occur, not globally, but in specific sub-regions of the data space? In this work, we introduce a novel score that measures the forgetting rate of deep models on validation data, capturing what we term local overfitting: a performance degradation confined to certain regions of the input space. We demonstrate that local overfitting can arise even without conventional overfitting, and is closely linked to the double descent phenomenon. Building on these insights, we introduce a two-stage approach that leverages the training history of a single model to recover and retain forgotten knowledge: first, by aggregating checkpoints into an ensemble, and then by distilling it into a single model of the original size, thus enhancing performance without added inference cost. Extensive experiments across multiple datasets, modern architectures, and training regimes validate the effectiveness of our approach. Notably, in the presence of label noise, our method - Knowledge Fusion followed by Knowledge Distillation - outperforms both the original model and independently trained ensembles, achieving a rare win-win scenario: reduced training and inference complexity.
- Research Article
- 10.1609/aaai.v39i19.34269
- Apr 11, 2025
- Proceedings of the AAAI Conference on Artificial Intelligence
The infrequent occurrence of overfitting in deep neural networks is perplexing: contrary to theoretical expectations, increasing model size often enhances performance in practice. But what if overfitting does occur, though restricted to specific sub-regions of the data space? In this work, we propose a novel score that captures the forgetting rate of deep models on validation data. We posit that this score quantifies local overfitting: a decline in performance confined to certain regions of the data space. We then show empirically that local overfitting occurs regardless of the presence of traditional overfitting. Using the framework of deep over-parametrized linear models, we offer a certain theoretical characterization of forgotten knowledge, and show that it correlates with knowledge forgotten by real deep models. Finally, we devise a new ensemble method that aims to recover forgotten knowledge, relying solely on the training history of a single network. When combined with knowledge distillation, this method will enhance the performance of a trained model without adding inference costs. Extensive empirical evaluations demonstrate the efficacy of our method across multiple datasets, contemporary neural network architectures, and training protocols.
- Research Article
1
- 10.59247/jahir.v2i2.289
- Aug 31, 2024
- Journal of Advanced Health Informatics Research
This research aims to apply the knowledge distillation method to medical image classification, specifically in the case of lung and colon image classification using various transfer learning models. Knowledge distillation allows the transfer of knowledge from a larger model (teacher) to a smaller model (student), which enables more efficient model building without sacrificing accuracy. In this research, the DenseNet169 model is used as the teacher model. The student model uses several alternative transfer learning architectures such as DenseNet121, MobileNet, ResNet50, InceptionV3, and Xception. The data used consists of 25,000 histopathology images that have been processed and divided into training, validation, and test data. Data augmentation was performed to enlarge the dataset from 750 to 25,000 images, which helped improve the performance of the model. Model performance evaluation was performed by measuring the accuracy and loss value of each student model compared to the teacher model. The results showed that the student models generated through the knowledge distillation process performed close to or even exceeded the teacher model in some cases, with the Xception model showing the highest accuracy of 96.95%. In conclusion, knowledge distillation is effective in reducing model complexity without compromising performance, which is particularly beneficial for implementation on resource-constrained devices.
- Conference Article
72
- 10.1109/iccv48922.2021.00501
- Oct 1, 2021
Knowledge distillation (KD) transfers the dark knowledge from cumbersome networks (teacher) to lightweight (student) networks and expects the student to achieve more promising performance than training without the teacher’s knowledge. However, a counter-intuitive argument is that better teachers do not make better students due to the capacity mismatch. To this end, we present a novel adaptive knowledge distillation method to complement traditional approaches. The proposed method, named as Student Customized Knowledge Distillation (SCKD), examines the capacity mismatch between teacher and student from the perspective of gradient similarity. We formulate the knowledge distillation as a multi-task learning problem so that the teacher transfers knowledge to the student only if the student can benefit from learning such knowledge. We validate our methods on multiple datasets with various teacher-student configurations on image classification, object detection, and semantic segmentation.
- Research Article
1
- 10.3390/app122110952
- Oct 28, 2022
- Applied Sciences
Structural network pruning is an effective way to reduce network size for deploying deep networks to resource-constrained devices. Existing methods mainly employ knowledge distillation from the last layer of network to guide pruning of the whole network, and informative features from intermediate layers are not yet fully exploited to improve pruning efficiency and accuracy. In this paper, we propose a block-wisely supervised network pruning (BNP) approach to find the optimal subnet from a baseline network based on knowledge distillation and Markov Chain Monte Carlo. To achieve this, the baseline network is divided into small blocks, and block shrinkage can be independently applied to each block under a same manner. Specifically, block-wise representations of the baseline network are exploited to supervise subnet search by encouraging each block of student network to imitate the behavior of the corresponding baseline block. A score metric measuring block accuracy and efficiency is assigned to each block, and block search is conducted under a Markov Chain Monte Carlo scheme to sample blocks from the posterior. Knowledge distillation enables effective feature representations of the student network, and Markov Chain Monte Carlo provides a sampling scheme to find the optimal solution. Extensive evaluations on multiple network architectures and datasets show BNP outperforms the state of the art. For instance, with 0.16% accuracy improvement on the CIFAR-10 dataset, it yields a more compact subnet of ResNet-110 than other methods by reducing 61.24% FLOPs.
- Conference Article
- 10.1109/iccea58433.2023.10135504
- Apr 7, 2023
In multi-view learning, how to effectively integrate features among multiple views becomes a key challenge. For multi-label tasks, it is extremely important that the relationship between features and labels. This paper proposes a Target Embedding Autoencoder framework based on knowledge distillation (tea-mvml), which explores the correlation between features and labels in Multi-View Multi-Label (MVML) learning. The framework transfers knowledge from a teacher model based on a Target Embedding Autoencoder (TEA) to a small student model through knowledge distillation. The teacher model of tea-mvml learns the relationship between features and labels in the potential space, while the student model has the generalization ability of the teacher model. Experimental results on multiple real-world datasets show that tea-mvml not only reduces the complexity of the model, but also outperforms other state-of-the-art multi-view multi-label classification approaches.
- Research Article
- 10.1049/cit2.70036
- Jul 4, 2025
- CAAI Transactions on Intelligence Technology
ABSTRACTWith the increasing constraints of hardware devices, there is a growing demand for compact models to be deployed on device endpoints. Knowledge distillation, a widely used technique for model compression and knowledge transfer, has gained significant attention in recent years. However, traditional distillation approaches compare the knowledge of individual samples indirectly through class prototypes overlooking the structural relationships between samples. Although recent distillation methods based on contrastive learning can capture relational knowledge, their relational constraints often distort the positional information of the samples leading to compromised performance in the distilled model. To address these challenges and further enhance the performance of compact models, we propose a novel approach, termed contrastive learning‐based multi‐level knowledge distillation (CLMKD). The CLMKD framework introduces three key modules: class‐guided contrastive distillation, gradient relation contrastive distillation, and semantic similarity distillation. These modules are effectively integrated into a unified framework to extract feature knowledge from multiple levels, capturing not only the representational consistency of individual samples but also their higher‐order structure and semantic similarity. We evaluate the proposed CLMKD method on multiple image classification datasets and the results demonstrate its superior performance compared to state‐of‐the‐art knowledge distillation methods.
- Research Article
2
- 10.1016/j.jnlest.2024.100278
- Aug 16, 2024
- Journal of Electronic Science and Technology
De-biased knowledge distillation framework based on knowledge infusion and label de-biasing techniques
- Conference Article
2
- 10.18653/v1/2021.wnut-1.33
- Jan 1, 2021
Knowledge Distillation (KD) is extensively used to compress and deploy large pre-trained language models on edge devices for real-world applications. However, one neglected area of research is the impact of noisy (corrupted) labels on KD. We present, to the best of our knowledge, the first study on KD with noisy labels in Natural Language Understanding (NLU). We document the scope of the problem and present two methods to mitigate the impact of label noise. Experiments on the GLUE benchmark show that our methods are effective even under high noise levels. Nevertheless, our results indicate that more research is necessary to cope with label noise under the KD.
- Research Article
- 10.1145/3808223
- Apr 21, 2026
- ACM Transactions on Information Systems
Large language models (LLMs) have been extensively applied in various recommendation scenarios, including bundle generation, thanks to their exceptional reasoning capabilities and comprehensive knowledge. However, exploiting large-scale LLMs for bundle generation introduces significant efficiency challenges—primarily high computational costs during fine-tuning and inference due to their massive parameterization. Knowledge distillation (KD) offers a promising solution by transferring expertise from large teacher models to more compact student models. This study systematically investigates KD approaches for bundle generation with the goal of minimizing computational demands while preserving performance. Specifically, we explore three critical research questions: (1) how does the format of distilled knowledge impact bundle generation performance? (2) to what extent does the quantity of distilled knowledge influence the performance? and (3) how do different ways of utilizing the distilled knowledge affect the performance? To support this investigation, we propose a comprehensive KD framework that (i) progressively extracts knowledge from raw data in increasingly complex forms, i.e., frequent patterns \(\rightarrow\) formalized rules \(\rightarrow\) deep thoughts; (ii) captures varying quantities of distilled knowledge through different sampling strategies, multi-domain accumulation, and multi-format aggregation; and (iii) exploits complementary LLM adaptation techniques—in-context learning, supervised fine-tuning and their combination—to leverage the distilled knowledge for domain-specific adaptation and enhanced efficiency in small student models. Through extensive experiments on multiple real-world datasets, we provide valuable insights into how knowledge format, quantity, and utilization methods collectively shape the performance of LLM-based bundle generation, which exhibits the significant potential of KD for more efficient yet effective LLM-based bundle generation.
- Research Article
12
- 10.3390/e26010096
- Jan 22, 2024
- Entropy
Federated learning allows multiple parties to train models while jointly protecting user privacy. However, traditional federated learning requires each client to have the same model structure to fuse the global model. In real-world scenarios, each client may need to develop personalized models based on its environment, making it difficult to perform federated learning in a heterogeneous model environment. Some knowledge distillation methods address the problem of heterogeneous model fusion to some extent. However, these methods assume that each client is trustworthy. Some clients may produce malicious or low-quality knowledge, making it difficult to aggregate trustworthy knowledge in a heterogeneous environment. To address these challenges, we propose a trustworthy heterogeneous federated learning framework (FedTKD) to achieve client identification and trustworthy knowledge fusion. Firstly, we propose a malicious client identification method based on client logit features, which can exclude malicious information in fusing global logit. Then, we propose a selectivity knowledge fusion method to achieve high-quality global logit computation. Additionally, we propose an adaptive knowledge distillation method to improve the accuracy of knowledge transfer from the server side to the client side. Finally, we design different attack and data distribution scenarios to validate our method. The experiment shows that our method outperforms the baseline methods, showing stable performance in all attack scenarios and achieving an accuracy improvement of 2% to 3% in different data distributions.
- Research Article
1
- 10.1109/tnnls.2025.3640274
- Jan 1, 2025
- IEEE transactions on neural networks and learning systems
Wearable sensors have found numerous applications in health and wellness promotion and have achieved great success leveraging advancements in deep learning. However, the development of robust continues to be hindered by issues related to sensor noise, inconsistent sampling rates, and individual differences. Topological data analysis (TDA) has emerged as a viable solution to extract robust features from such time-series data by converting them into persistence images (PIs), which capture intrinsic characteristics and demonstrate resilience to noise and signal variations. However, the computational costs of TDA pose significant challenges for small devices with limited resources. To more efficiently incorporate topological features, we utilize knowledge distillation (KD), which is a promising way to generate a smaller model using larger models. Multiple teachers can be adopted to enrich features in KD. However, this approach has presented two key challenges: 1) differences in feature dimensions from multimodal data and 2) conflicting knowledge provided by the different teachers, both of which can degrade the student model's performance. To address these issues, we propose a novel KD framework called multimodal global latent workspace-based KD (mGLW-KD) that is motivated by global workspace theory (GTW) from cognitive neuroscience. GWT models how the brain integrates and distributes relevant information across different neural modules through a shared workspace, and it includes attentional control and working memory to prioritize and retain key information. Inspired by this theory, mGLW-KD incorporates a working memory module to unify diverse knowledge from multiple teacher models into a shared latent workspace, facilitating efficient knowledge transfer to the student model. By integrating topological insights with cognitive principles, mGLW-KD addresses the challenges posed by wearable sensor data and enables the student model to achieve superior performance using only time-series input during inference.
- Research Article
2
- 10.3390/electronics14091784
- Apr 27, 2025
- Electronics
With the increasing severity of data privacy and security issues, cross-organizational federated learning is facing challenges in communication efficiency and cost. Knowledge distillation, as an effective model compression technique, can reduce model size without significantly compromising accuracy, thereby lowering communication overhead. However, existing knowledge distillation methods either employ static distillation loss weights, ignoring bandwidth variations in communication networks, or fail to effectively account for bandwidth heterogeneity among different nodes, leading to communication bottlenecks. To enhance the overall system efficiency, there is an urgent need to find new methods that enable models to achieve optimal performance in resource-constrained environments. This paper proposes a communication optimization method based on mutual knowledge distillation (Fed-MKD) to address the bottleneck issues caused by high communication costs in cross-organizational federated learning. By leveraging a mutual distillation mechanism, Fed-MKD enables collaborative training of teacher and student models locally while reducing the frequency and size of global model transmissions to optimize communication. Our experimental results demonstrate that, compared to classical knowledge distillation methods, Fed-MKD significantly improves communication efficiency, with compression ratios ranging from 4.89× to 28.45×. Furthermore, Fed-MKD achieves up to 4.34× acceleration in convergence time across multiple datasets. These findings highlight the significant practical value of Fed-MKD in environments with heterogeneous data distributions and limited communication resources.
- Research Article
- 10.13031/aea.16203
- Jan 1, 2025
- Applied Engineering in Agriculture
Highlights Lightweight network for accurate aphid counting. Two complementary feature measurements for efficient knowledge transfer. Achieving comparable counting performance using about one-fifth of the parameters of the baseline. Strong applicability to other tasks, such as mealworm counting. Abstract. In the fields of agriculture, forestry, and horticulture, aphids are a pernicious pest with the most deleterious effect on crops, the widest geographical range, and the most rapid reproductive rate. Precisely handling these pests is therefore an urgent need in planting automation, and a vital prerequisite is to count these aphids. The aphid counting model based on computer vision accomplishes precise aphid counting by learning a mapping between images and real labels. Nevertheless, in the practical application deployment, prevailing high-performance network models usually consist of a large number of parameters and have high hardware requirements, making it problematic to apply to edge devices. To tackle these challenges, this work studied a robust feature transfer strategy for efficient distillation of high-performance aphid-counting networks. Specifically, two complementary loss functions were explored to extract effective knowledge from the teacher network and enhance the learning capability of the student network. Experimental results showed that, compared with recent methods, our method achieved significantly better results across multiple datasets with minimal computational overhead. Meanwhile, it also achieved state-of-the-art performance on the mealworm dataset, demonstrating the effectiveness and applicability of our method. Keywords: Aphid counting, Convolutional networks, Knowledge distillation, Light-weight architecture.
- Research Article
17
- 10.1109/tcc.2022.3160129
- Apr 1, 2023
- IEEE Transactions on Cloud Computing
In recent years, deep neural networks have shown extraordinary power in various practical learning tasks, especially in object detection, classification, natural language processing. However, deploying such large models on resource-constrained devices or embedded systems is challenging due to their high computational cost. Efforts such as model partition, pruning, or quantization have been used at the expense of accuracy loss. Knowledge distillation is a technique that transfers model knowledge from a well-trained model (teacher) to a smaller and shallow model (student). Instead of using a learning model on the cloud, we can deploy distilled models on various edge devices, significantly reducing the computational cost, memory usage and prolonging the battery lifetime. In this work, we propose a novel neuron manifold distillation (NMD) method, where the student models imitate the teacher's output distribution and learn the feature geometry of the teacher model. In addition, to further improve the cloud-based learning system reliability, we propose a confident prediction mechanism to calibrate the model predictions. We conduct experiments with different distillation configurations over multiple datasets. Our proposed method demonstrates a consistent improvement in accuracy-speed trade-offs for the distilled model.
- Research Article
5
- 10.1109/tnnls.2023.3335829
- Jan 1, 2025
- IEEE transactions on neural networks and learning systems
Knowledge distillation (KD), which aims at transferring the knowledge from a complex network (a teacher) to a simpler and smaller network (a student), has received considerable attention in recent years. Typically, most existing KD methods work on well-labeled data. Unfortunately, real-world data often inevitably involve noisy labels, thus leading to performance deterioration of these methods. In this article, we study a little-explored but important issue, i.e., KD with noisy labels. To this end, we propose a novel KD method, called ambiguity-guided mutual label refinery KD (AML-KD), to train the student model in the presence of noisy labels. Specifically, based on the pretrained teacher model, a two-stage label refinery framework is innovatively introduced to refine labels gradually. In the first stage, we perform label propagation (LP) with small-loss selection guided by the teacher model, improving the learning capability of the student model. In the second stage, we perform mutual LP between the teacher and student models in a mutual-benefit way. During the label refinery, an ambiguity-aware weight estimation (AWE) module is developed to address the problem of ambiguous samples, avoiding overfitting these samples. One distinct advantage of AML-KD is that it is capable of learning a high-accuracy and low-cost student model with label noise. The experimental results on synthetic and real-world noisy datasets show the effectiveness of our AML-KD against state-of-the-art KD methods and label noise learning (LNL) methods. Code is available at https://github.com/Runqing-forMost/ AML-KD.