Knowledge Distillation in Image Classification: The Impact of Datasets
As the demand for efficient and lightweight models in image classification grows, knowledge distillation has emerged as a promising technique to transfer expertise from complex teacher models to simpler student models. However, the efficacy of knowledge distillation is intricately linked to the choice of datasets used during training. Datasets are pivotal in shaping a model’s learning process, influencing its ability to generalize and discriminate between diverse patterns. While considerable research has independently explored knowledge distillation and image classification, a comprehensive understanding of how different datasets impact knowledge distillation remains a critical gap. This study systematically investigates the impact of diverse datasets on knowledge distillation in image classification. By varying dataset characteristics such as size, domain specificity, and inherent biases, we aim to unravel the nuanced relationship between datasets and the efficacy of knowledge transfer. Our experiments employ a range of datasets to comprehensively explore their impact on the performance gains achieved through knowledge distillation. This study contributes valuable guidance for researchers and practitioners seeking to optimize image classification models through kno-featured applications. By elucidating the intricate interplay between dataset characteristics and knowledge distillation outcomes, our findings empower the community to make informed decisions when selecting datasets, ultimately advancing the field toward more robust and efficient model development.
- Research Article
39
- 10.34133/plantphenomics.0062
- Jan 1, 2023
- Plant Phenomics
Plant disease diagnosis in time can inhibit the spread of the disease and prevent a large-scale drop in production, which benefits food production. Object detection-based plant disease diagnosis methods have attracted widespread attention due to their accuracy in classifying and locating diseases. However, existing methods are still limited to single crop disease diagnosis. More importantly, the existing model has a large number of parameters, which is not conducive to deploying it to agricultural mobile devices. Nonetheless, reducing the number of model parameters tends to cause a decrease in model accuracy. To solve these problems, we propose a plant disease detection method based on knowledge distillation to achieve a lightweight and efficient diagnosis of multiple diseases across multiple crops. In detail, we design 2 strategies to build 4 different lightweight models as student models: the YOLOR-Light-v1, YOLOR-Light-v2, Mobile-YOLOR-v1, and Mobile-YOLOR-v2 models, and adopt the YOLOR model as the teacher model. We develop a multistage knowledge distillation method to improve lightweight model performance, achieving 60.4% mAP@.5 in the PlantDoc dataset with small model parameters, outperforming existing methods. Overall, the multistage knowledge distillation technique can make the model lighter while maintaining high accuracy. Not only that, the technique can be extended to other tasks, such as image classification and image segmentation, to obtain automated plant disease diagnostic models with a wider range of lightweight applicability in smart agriculture. Our code is available at https://github.com/QDH/MSKD.
- Research Article
8
- 10.1109/tcss.2023.3293882
- Apr 1, 2024
- IEEE Transactions on Computational Social Systems
Recent multiobject tracking (MOT) methods usually use very deep neural networks to achieve competitive accuracy, which inevitably results in degraded inference speed. To strike a better balance between tracking accuracy and speed, in this work, we propose to compress the MOT model via knowledge distillation (KD), enabling the more lightweight student model to obtain similar performance as the teacher model. Nonetheless, despite KD has been well studied for simpler tasks such as image classification, the complexity of MOT poses new challenges because the MOT model is more sensitive to foreground information than the classification model. To deal with that, we first propose attention-guided feature distillation, which focuses the student model on the crucial region (foreground and the region with strong discrepancy against itself) of the teacher’s feature map. Moreover, we propose foreground mask, which leverages the knowledge from the teacher model to filter out the low-quality soft labels from the background, thereby reducing their negative effects for distillation. Evaluations on several benchmarks demonstrate that the proposed KD method can make the student network achieve leading performance, meanwhile running faster than the teacher network 20.0%–27.4% and reducing the parameters 28.5%–87.1%. To the best of our knowledge, this is the first work to compress the MOT model via KD.
- Research Article
11
- 10.70917/2025014
- Jan 6, 2025
- International Journal of Computer Information Systems and Industrial Management Applications
As edge computing gains attention across various domains, the demand for lightweight deep learning models capable of running efffciently on resource-constrained edge devices has surged. This survey investigates the landscape of lightweight deep learning models tailored for edge computing environments. The survey explores various model compression techniques used to design and optimize deep learning models for edge deployment, including model quantization, pruning, and knowledge distillation. Emphasis is placed on strategies to reduce model size, computational complexity, and memory footprint while maintaining satisfactory performance levels. Additionally, the study examines the performances of these techniques on three real-life datasets evaluating lightweight deep learning models, highlighting the importance of balanced datasets representative of edge device deployment scenarios. Furthermore, this survey provides a comprehensive overview of the current state of lightweight deep learning models for edge devices, offering insights into design considerations, optimization techniques, and performance evaluation methodologies. The ffndings show that most of the compression techniques suffer from performance degradation, proving the existence of a trade-off between compression and performance. Therefore, we proposed a hybrid losslesscompressed model by combining pruning quantization, and knowledge distillation, to reduce parameters and weights, resulting in a lightweight model. The proposed model is three times smaller than the vanilla CNN model and achieved a state-of-the-art accuracy of 97% after compression, which shows the effectiveness of our approach. These results will serve as a valuable resource for researchers and practitioners aiming to develop efffcient and scalable deep learning solutions for edge computing applications.
- Research Article
5
- 10.70917/ijcisim-2025-0014
- Jan 6, 2025
- International Journal of Computer Information Systems and Industrial Management Applications
As edge computing gains attention across various domains, the demand for lightweight deep learning models capable of running efffciently on resource-constrained edge devices has surged. This survey investigates the landscape of lightweight deep learning models tailored for edge computing environments. The survey explores various model compression techniques used to design and optimize deep learning models for edge deployment, including model quantization, pruning, and knowledge distillation. Emphasis is placed on strategies to reduce model size, computational complexity, and memory footprint while maintaining satisfactory performance levels. Additionally, the study examines the performances of these techniques on three real-life datasets evaluating lightweight deep learning models, highlighting the importance of balanced datasets representative of edge device deployment scenarios. Furthermore, this survey provides a comprehensive overview of the current state of lightweight deep learning models for edge devices, offering insights into design considerations, optimization techniques, and performance evaluation methodologies. The ffndings show that most of the compression techniques suffer from performance degradation, proving the existence of a trade-off between compression and performance. Therefore, we proposed a hybrid losslesscompressed model by combining pruning quantization, and knowledge distillation, to reduce parameters and weights, resulting in a lightweight model. The proposed model is three times smaller than the vanilla CNN model and achieved a state-of-the-art accuracy of 97% after compression, which shows the effectiveness of our approach. These results will serve as a valuable resource for researchers and practitioners aiming to develop efffcient and scalable deep learning solutions for edge computing applications.
- Research Article
3
- 10.1109/embc40787.2023.10340704
- Jul 24, 2023
- Annual International Conference of the IEEE Engineering in Medicine and Biology Society. IEEE Engineering in Medicine and Biology Society. Annual International Conference
Cardiovascular diseases (CVDs) are the number one cause of death worldwide. In recent years, intelligent auxiliary diagnosis of CVDs based on computer audition has become a popular research field, and intelligent diagnosis technology is increasingly mature. Neural networks used to monitor CVDs are becoming more complex, requiring more computing power and memory, and are difficult to deploy in wearable devices. This paper proposes a lightweight model for classifying heart sounds based on knowledge distillation, which can be deployed in wearable devices to monitor the heart sounds of wearers. The network model is designed based on Convolutional Neural Networks (CNNs). Model performance is evaluated by extracting Mel Frequency Cepstral Coefficients (MFCCs) features from the PhysioNet/CinC Challenge 2016 dataset. The experimental results show that knowledge distillation can improve a lightweight network's accuracy, and our model performs well on the test set. Especially, when the knowledge distillation temperature is 7 and the weight α is 0.1, the accuracy is 88.5 %, the recall is 83.8 %, and the specificity is 93.6 %.Clinical relevance- A lightweight model of heart sound classification based on knowledge distillation can be deployed on various hardware devices for timely monitoring and feedback of the physical condition of patients with CVDs for timely provision of medical advice. When the model is deployed on the medical instruments of the hospital, the condition of severe and hospitalised patients can be timely fed back and clinical treatment advice can be provided to the clinicians.
- Research Article
23
- 10.1016/j.heliyon.2024.e34376
- Jul 1, 2024
- Heliyon
Efficient image classification through collaborative knowledge distillation: A novel AlexNet modification approach
- Research Article
19
- 10.3390/sym17071002
- Jun 25, 2025
- Symmetry
Knowledge distillation (KD) is crucial for remote sensing image (RSI) classification, particularly as the operating environment in remote sensing is often constrained by hardware limitations. However, prior research has not fully addressed the challenge of leveraging KD to develop lightweight, high-accuracy models for RSI classification. A key issue is the sparse distribution of training data, which often results in asymmetry within the data. This asymmetry impedes the transfer of prior knowledge during the distillation process, diminishing the overall efficacy of KD techniques. To overcome this challenge, we propose a novel, symmetry-enhanced approach that augments the logit-based KD process, improving its effectiveness and efficiency for RSI classification. Our method is distinguished by three core innovations: a symmetrically generative algorithm to enhance the symmetry of the training data, an efficient algorithm for constructing a robust teacher ensemble model, and a quantitative technique for feature alignment. Rigorous evaluations on three benchmark datasets demonstrate that our method outperforms 14 existing KD-based approaches and 30 other state-of-the-art methods. Specifically, the student model trained with our approach achieves accuracy improvements of up to 22.5% while reducing the model size and inference time by as much as 96% and 88%, respectively. In conclusion, this research makes a significant contribution to RSI classification by introducing an efficient and effective data symmetry-driven method to enhance the knowledge transferring efficiency of the logit-based KD process.
- Conference Article
7
- 10.1109/icaice54393.2021.00127
- Nov 1, 2021
Model fusion can effectively improve the effect of model prediction, but it will bring about an increase in time. In this paper, the dual-stage progressive knowledge distillation is improved in combination with multi-teacher knowledge distillation technology. A simple and effective multi-teacher's Softtarget integration method is proposed in multi-teacher network knowledge distillation. Improve the guiding role of excellent models in knowledge distillation. Dual-stage progressive knowledge distillation is a method for small sample knowledge distillation. A progressive network grafting method is used to realize knowledge distillation in a small sample environment. In the first step, the student blocks are grafted one by one onto the teacher network and intertwined with other teacher blocks for training, and the training process only updates the parameters of the grafted blocks. In the second step, the trained student blocks are grafted onto the teacher network in turn, so that the learned student blocks adapt to each other and finally replace the teacher network to obtain a lighter network structure. Using Softtarget acquired by this method in Dual-stage progressive knowledge distillation instead of Hardtarget training, excellent results were obtained on BreakHis data sets.
- Research Article
1
- 10.59247/jahir.v2i2.289
- Aug 31, 2024
- Journal of Advanced Health Informatics Research
This research aims to apply the knowledge distillation method to medical image classification, specifically in the case of lung and colon image classification using various transfer learning models. Knowledge distillation allows the transfer of knowledge from a larger model (teacher) to a smaller model (student), which enables more efficient model building without sacrificing accuracy. In this research, the DenseNet169 model is used as the teacher model. The student model uses several alternative transfer learning architectures such as DenseNet121, MobileNet, ResNet50, InceptionV3, and Xception. The data used consists of 25,000 histopathology images that have been processed and divided into training, validation, and test data. Data augmentation was performed to enlarge the dataset from 750 to 25,000 images, which helped improve the performance of the model. Model performance evaluation was performed by measuring the accuracy and loss value of each student model compared to the teacher model. The results showed that the student models generated through the knowledge distillation process performed close to or even exceeded the teacher model in some cases, with the Xception model showing the highest accuracy of 96.95%. In conclusion, knowledge distillation is effective in reducing model complexity without compromising performance, which is particularly beneficial for implementation on resource-constrained devices.
- Conference Article
8
- 10.1109/iscslp57327.2022.10038276
- Dec 11, 2022
Very deep models for speaker recognition (SR) have demonstrated remarkable performance improvement in recent research. However, it is impractical to deploy these models for on-device applications with constrained computational resources. On the other hand, light-weight models are highly desired in practice despite their sub-optimal performance. This research aims to improve light-weight SR models through large-scale label-free knowledge distillation (KD). Existing KD approaches for SR typically require speaker labels to learn task-specific knowledge, due to the inefficiency of conventional loss for distillation. To address the inefficiency problem and achieve label-free KD, we propose to employ the contrastive loss from self-supervised learning for distillation. Extensive experiments are conducted on a collection of public speech datasets from diverse sources. Results on light-weight SR models show that the proposed approach of label-free KD with contrastive loss consistently outperforms both conventional distillation methods and self-supervised learning methods by a significant margin.
- Conference Article
1
- 10.1109/icpeca51329.2021.9362719
- Jan 22, 2021
At this stage, the popular deep neural network models often encounter problems of high latency, difficult deployment and high hardware requirements in practical applications. Knowledge distillation is a good approach to solve these problems. We adopted an innovative knowledge distillation approach and formulated data augmentation strategies for the tasks, and obtained a lightweight model with 6. 7x acceleration ratio and 13. 6x compression ratio compared to the baseline model BERT-base, and the average performance of the lightweight model reached 95% of BERT-base for each task. We continue to conduct in-depth research to investigate some of the issues that remain in the knowledge distillation phase. To address the problems in distillation model selection and model fine-tuning, we propose a teacher model and student model selection strategy and a two-stage model fine-tuning strategy before and after the knowledge distillation stage. These two strategies further improve the average performance of the models to 98% of BERT-base. Finally, we developed a lightweight model evaluation scheme based on different types of downstream tasks, which provides a reference for subsequent practical applications when encountering similar tasks.
- Research Article
7
- 10.1038/s41598-024-69813-6
- Aug 14, 2024
- Scientific Reports
This paper presents a Cosine Similarity-Based Knowledge Distillation (CSKD) for robust, lightweight object detectors. Knowledge Distillation (KD) has been effective in enhancing the performance of compact models in image classification by leveraging deep CNN models. However, the complex and multifaceted nature of object detection, characterized by its modular design and multitasking requirements, poses significant challenges for traditional KD techniques. These challenges are further compounded by the conventional reliance on the Mean Squared Error (MSE) loss function and the limited application of enhanced feature representations to the training phase. Addressing these limitations, the proposed CSKD method combines cosine similarity guidance with MSE loss to facilitate a more effective knowledge transfer from the teacher model to the student model. This is achieved by distilling both intermediate features and prediction outputs, aided by an assistant prediction branch designed to learn directly from the teacher’s predictions. This dual-faceted distillation strategy enables the student model to better mimic the teacher model’s behavior, leading to improved performance. The proposed method demonstrates versatility and robustness across various object detector architectures without the need for additional feature enhancement layers during training. Notably, employing ResNet-50 as the teacher model and ResNet-18 as the student model, we achieve new benchmarks in KD for object detection across several architectures, including Faster-RCNN, RetinaNet, FCOS, and GFL, with respective mAP scores of 36.6, 35.2, 35.9, and 38.9. These results highlights the effectiveness of CSKD in advancing the state-of-the-art in KD for object detection, offering a compelling solution to the challenges previously faced by traditional KD methods in this domain. The code of the proposed CSKD is available at https://github.com/swkdn16/CSKD.
- Research Article
- 10.1016/j.iswa.2026.200638
- May 1, 2026
- Intelligent Systems with Applications
Leveraging knowledge distillation for lightweight and interpretable deep learning in Ethiopian medicinal plant classification
- Book Chapter
10
- 10.1016/b978-0-32-385787-1.00013-0
- Jan 1, 2022
- Deep Learning for Robot Perception and Cognition
Chapter 8 - Knowledge distillation
- Research Article
26
- 10.1109/tcbb.2023.3272333
- Jul 1, 2024
- IEEE/ACM transactions on computational biology and bioinformatics
Automated multi-label chest X-rays (CXR) image classification has achieved substantial progress in clinical diagnosis via utilizing sophisticated deep learning approaches. However, most deep models have high computational demands, which makes them less feasible for compact devices with low computational requirements. To overcome this problem, we propose a knowledge distillation (KD) strategy to create the compact deep learning model for the real-time multi-label CXR image classification. We study different alternatives of CNNs and Transforms as the teacher to distill the knowledge to a smaller student. Then, we employed explainable artificial intelligence (XAI) to provide the visual explanation for the model decision improved by the KD. Our results on three benchmark CXR datasets show that our KD strategy provides the improved performance on the compact student model, thus being the feasible choice for many limited hardware platforms. For instance, when using DenseNet161 as the teacher network, EEEA-Net-C2 achieved an AUC of 83.7%, 87.1%, and 88.7% on the ChestX-ray14, CheXpert, and PadChest datasets, respectively, with fewer parameters of 4.7 million and computational cost of 0.3 billion FLOPS.