Knowledge Distillation in Image Classification: The Impact of Datasets

  • Abstract
  • Highlights & Summary
  • PDF
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon

As the demand for efficient and lightweight models in image classification grows, knowledge distillation has emerged as a promising technique to transfer expertise from complex teacher models to simpler student models. However, the efficacy of knowledge distillation is intricately linked to the choice of datasets used during training. Datasets are pivotal in shaping a model’s learning process, influencing its ability to generalize and discriminate between diverse patterns. While considerable research has independently explored knowledge distillation and image classification, a comprehensive understanding of how different datasets impact knowledge distillation remains a critical gap. This study systematically investigates the impact of diverse datasets on knowledge distillation in image classification. By varying dataset characteristics such as size, domain specificity, and inherent biases, we aim to unravel the nuanced relationship between datasets and the efficacy of knowledge transfer. Our experiments employ a range of datasets to comprehensively explore their impact on the performance gains achieved through knowledge distillation. This study contributes valuable guidance for researchers and practitioners seeking to optimize image classification models through kno-featured applications. By elucidating the intricate interplay between dataset characteristics and knowledge distillation outcomes, our findings empower the community to make informed decisions when selecting datasets, ultimately advancing the field toward more robust and efficient model development.

Similar Papers
  • PDF Download Icon
  • Research Article
  • Cite Count Icon 39
  • 10.34133/plantphenomics.0062
Knowledge Distillation Facilitates the Lightweight and Efficient Plant Diseases Detection Model.
  • Jan 1, 2023
  • Plant Phenomics
  • Qianding Huang + 7 more

Plant disease diagnosis in time can inhibit the spread of the disease and prevent a large-scale drop in production, which benefits food production. Object detection-based plant disease diagnosis methods have attracted widespread attention due to their accuracy in classifying and locating diseases. However, existing methods are still limited to single crop disease diagnosis. More importantly, the existing model has a large number of parameters, which is not conducive to deploying it to agricultural mobile devices. Nonetheless, reducing the number of model parameters tends to cause a decrease in model accuracy. To solve these problems, we propose a plant disease detection method based on knowledge distillation to achieve a lightweight and efficient diagnosis of multiple diseases across multiple crops. In detail, we design 2 strategies to build 4 different lightweight models as student models: the YOLOR-Light-v1, YOLOR-Light-v2, Mobile-YOLOR-v1, and Mobile-YOLOR-v2 models, and adopt the YOLOR model as the teacher model. We develop a multistage knowledge distillation method to improve lightweight model performance, achieving 60.4% mAP@.5 in the PlantDoc dataset with small model parameters, outperforming existing methods. Overall, the multistage knowledge distillation technique can make the model lighter while maintaining high accuracy. Not only that, the technique can be extended to other tasks, such as image classification and image segmentation, to obtain automated plant disease diagnostic models with a wider range of lightweight applicability in smart agriculture. Our code is available at https://github.com/QDH/MSKD.

  • Research Article
  • Cite Count Icon 8
  • 10.1109/tcss.2023.3293882
Compressing the Multiobject Tracking Model via Knowledge Distillation
  • Apr 1, 2024
  • IEEE Transactions on Computational Social Systems
  • Tianyi Liang + 5 more

Recent multiobject tracking (MOT) methods usually use very deep neural networks to achieve competitive accuracy, which inevitably results in degraded inference speed. To strike a better balance between tracking accuracy and speed, in this work, we propose to compress the MOT model via knowledge distillation (KD), enabling the more lightweight student model to obtain similar performance as the teacher model. Nonetheless, despite KD has been well studied for simpler tasks such as image classification, the complexity of MOT poses new challenges because the MOT model is more sensitive to foreground information than the classification model. To deal with that, we first propose attention-guided feature distillation, which focuses the student model on the crucial region (foreground and the region with strong discrepancy against itself) of the teacher’s feature map. Moreover, we propose foreground mask, which leverages the knowledge from the teacher model to filter out the low-quality soft labels from the background, thereby reducing their negative effects for distillation. Evaluations on several benchmarks demonstrate that the proposed KD method can make the student network achieve leading performance, meanwhile running faster than the teacher network 20.0%–27.4% and reducing the parameters 28.5%–87.1%. To the best of our knowledge, this is the first work to compress the MOT model via KD.

  • Research Article
  • Cite Count Icon 11
  • 10.70917/2025014
Lightweight Deep Learning Models For Edge Devices—A Survey
  • Jan 6, 2025
  • International Journal of Computer Information Systems and Industrial Management Applications
  • Aminu Musa + 5 more

As edge computing gains attention across various domains, the demand for lightweight deep learning models capable of running efffciently on resource-constrained edge devices has surged. This survey investigates the landscape of lightweight deep learning models tailored for edge computing environments. The survey explores various model compression techniques used to design and optimize deep learning models for edge deployment, including model quantization, pruning, and knowledge distillation. Emphasis is placed on strategies to reduce model size, computational complexity, and memory footprint while maintaining satisfactory performance levels. Additionally, the study examines the performances of these techniques on three real-life datasets evaluating lightweight deep learning models, highlighting the importance of balanced datasets representative of edge device deployment scenarios. Furthermore, this survey provides a comprehensive overview of the current state of lightweight deep learning models for edge devices, offering insights into design considerations, optimization techniques, and performance evaluation methodologies. The ffndings show that most of the compression techniques suffer from performance degradation, proving the existence of a trade-off between compression and performance. Therefore, we proposed a hybrid losslesscompressed model by combining pruning quantization, and knowledge distillation, to reduce parameters and weights, resulting in a lightweight model. The proposed model is three times smaller than the vanilla CNN model and achieved a state-of-the-art accuracy of 97% after compression, which shows the effectiveness of our approach. These results will serve as a valuable resource for researchers and practitioners aiming to develop efffcient and scalable deep learning solutions for edge computing applications.

  • Research Article
  • Cite Count Icon 5
  • 10.70917/ijcisim-2025-0014
Lightweight Deep Learning Models For Edge Devices—A Survey
  • Jan 6, 2025
  • International Journal of Computer Information Systems and Industrial Management Applications
  • Aminu Musa + 5 more

As edge computing gains attention across various domains, the demand for lightweight deep learning models capable of running efffciently on resource-constrained edge devices has surged. This survey investigates the landscape of lightweight deep learning models tailored for edge computing environments. The survey explores various model compression techniques used to design and optimize deep learning models for edge deployment, including model quantization, pruning, and knowledge distillation. Emphasis is placed on strategies to reduce model size, computational complexity, and memory footprint while maintaining satisfactory performance levels. Additionally, the study examines the performances of these techniques on three real-life datasets evaluating lightweight deep learning models, highlighting the importance of balanced datasets representative of edge device deployment scenarios. Furthermore, this survey provides a comprehensive overview of the current state of lightweight deep learning models for edge devices, offering insights into design considerations, optimization techniques, and performance evaluation methodologies. The ffndings show that most of the compression techniques suffer from performance degradation, proving the existence of a trade-off between compression and performance. Therefore, we proposed a hybrid losslesscompressed model by combining pruning quantization, and knowledge distillation, to reduce parameters and weights, resulting in a lightweight model. The proposed model is three times smaller than the vanilla CNN model and achieved a state-of-the-art accuracy of 97% after compression, which shows the effectiveness of our approach. These results will serve as a valuable resource for researchers and practitioners aiming to develop efffcient and scalable deep learning solutions for edge computing applications.

  • Research Article
  • Cite Count Icon 3
  • 10.1109/embc40787.2023.10340704
Cutting Weights of Deep Learning Models for Heart Sound Classification: Introducing a Knowledge Distillation Approach.
  • Jul 24, 2023
  • Annual International Conference of the IEEE Engineering in Medicine and Biology Society. IEEE Engineering in Medicine and Biology Society. Annual International Conference
  • Zikai Song + 7 more

Cardiovascular diseases (CVDs) are the number one cause of death worldwide. In recent years, intelligent auxiliary diagnosis of CVDs based on computer audition has become a popular research field, and intelligent diagnosis technology is increasingly mature. Neural networks used to monitor CVDs are becoming more complex, requiring more computing power and memory, and are difficult to deploy in wearable devices. This paper proposes a lightweight model for classifying heart sounds based on knowledge distillation, which can be deployed in wearable devices to monitor the heart sounds of wearers. The network model is designed based on Convolutional Neural Networks (CNNs). Model performance is evaluated by extracting Mel Frequency Cepstral Coefficients (MFCCs) features from the PhysioNet/CinC Challenge 2016 dataset. The experimental results show that knowledge distillation can improve a lightweight network's accuracy, and our model performs well on the test set. Especially, when the knowledge distillation temperature is 7 and the weight α is 0.1, the accuracy is 88.5 %, the recall is 83.8 %, and the specificity is 93.6 %.Clinical relevance- A lightweight model of heart sound classification based on knowledge distillation can be deployed on various hardware devices for timely monitoring and feedback of the physical condition of patients with CVDs for timely provision of medical advice. When the model is deployed on the medical instruments of the hospital, the condition of severe and hospitalised patients can be timely fed back and clinical treatment advice can be provided to the clinicians.

  • Research Article
  • Cite Count Icon 23
  • 10.1016/j.heliyon.2024.e34376
Efficient image classification through collaborative knowledge distillation: A novel AlexNet modification approach
  • Jul 1, 2024
  • Heliyon
  • Avazov Kuldashboy + 5 more

Efficient image classification through collaborative knowledge distillation: A novel AlexNet modification approach

  • Research Article
  • Cite Count Icon 19
  • 10.3390/sym17071002
Symmetrical Learning and Transferring: Efficient Knowledge Distillation for Remote Sensing Image Classification
  • Jun 25, 2025
  • Symmetry
  • Huaxiang Song + 9 more

Knowledge distillation (KD) is crucial for remote sensing image (RSI) classification, particularly as the operating environment in remote sensing is often constrained by hardware limitations. However, prior research has not fully addressed the challenge of leveraging KD to develop lightweight, high-accuracy models for RSI classification. A key issue is the sparse distribution of training data, which often results in asymmetry within the data. This asymmetry impedes the transfer of prior knowledge during the distillation process, diminishing the overall efficacy of KD techniques. To overcome this challenge, we propose a novel, symmetry-enhanced approach that augments the logit-based KD process, improving its effectiveness and efficiency for RSI classification. Our method is distinguished by three core innovations: a symmetrically generative algorithm to enhance the symmetry of the training data, an efficient algorithm for constructing a robust teacher ensemble model, and a quantitative technique for feature alignment. Rigorous evaluations on three benchmark datasets demonstrate that our method outperforms 14 existing KD-based approaches and 30 other state-of-the-art methods. Specifically, the student model trained with our approach achieves accuracy improvements of up to 22.5% while reducing the model size and inference time by as much as 96% and 88%, respectively. In conclusion, this research makes a significant contribution to RSI classification by introducing an efficient and effective data symmetry-driven method to enhance the knowledge transferring efficiency of the logit-based KD process.

  • Conference Article
  • Cite Count Icon 7
  • 10.1109/icaice54393.2021.00127
Classification of Histopathologic Images of Breast Cancer by Multi-teacher Small-sample Knowledge Distillation
  • Nov 1, 2021
  • Leiqi Wang + 1 more

Model fusion can effectively improve the effect of model prediction, but it will bring about an increase in time. In this paper, the dual-stage progressive knowledge distillation is improved in combination with multi-teacher knowledge distillation technology. A simple and effective multi-teacher's Softtarget integration method is proposed in multi-teacher network knowledge distillation. Improve the guiding role of excellent models in knowledge distillation. Dual-stage progressive knowledge distillation is a method for small sample knowledge distillation. A progressive network grafting method is used to realize knowledge distillation in a small sample environment. In the first step, the student blocks are grafted one by one onto the teacher network and intertwined with other teacher blocks for training, and the training process only updates the parameters of the grafted blocks. In the second step, the trained student blocks are grafted onto the teacher network in turn, so that the learned student blocks adapt to each other and finally replace the teacher network to obtain a lighter network structure. Using Softtarget acquired by this method in Dual-stage progressive knowledge distillation instead of Hardtarget training, excellent results were obtained on BreakHis data sets.

  • Research Article
  • Cite Count Icon 1
  • 10.59247/jahir.v2i2.289
Comparison of Transfer Learning Performance in Lung and Colon Classification with Knowledge Distillation
  • Aug 31, 2024
  • Journal of Advanced Health Informatics Research
  • Annastasya Nabila Elsa Wulandari + 3 more

This research aims to apply the knowledge distillation method to medical image classification, specifically in the case of lung and colon image classification using various transfer learning models. Knowledge distillation allows the transfer of knowledge from a larger model (teacher) to a smaller model (student), which enables more efficient model building without sacrificing accuracy. In this research, the DenseNet169 model is used as the teacher model. The student model uses several alternative transfer learning architectures such as DenseNet121, MobileNet, ResNet50, InceptionV3, and Xception. The data used consists of 25,000 histopathology images that have been processed and divided into training, validation, and test data. Data augmentation was performed to enlarge the dataset from 750 to 25,000 images, which helped improve the performance of the model. Model performance evaluation was performed by measuring the accuracy and loss value of each student model compared to the teacher model. The results showed that the student models generated through the knowledge distillation process performed close to or even exceeded the teacher model in some cases, with the Xception model showing the highest accuracy of 96.95%. In conclusion, knowledge distillation is effective in reducing model complexity without compromising performance, which is particularly beneficial for implementation on resource-constrained devices.

  • Conference Article
  • Cite Count Icon 8
  • 10.1109/iscslp57327.2022.10038276
Label-free Knowledge Distillation with Contrastive Loss for Light-weight Speaker Recognition
  • Dec 11, 2022
  • Zhiyuan Peng + 4 more

Very deep models for speaker recognition (SR) have demonstrated remarkable performance improvement in recent research. However, it is impractical to deploy these models for on-device applications with constrained computational resources. On the other hand, light-weight models are highly desired in practice despite their sub-optimal performance. This research aims to improve light-weight SR models through large-scale label-free knowledge distillation (KD). Existing KD approaches for SR typically require speaker labels to learn task-specific knowledge, due to the inefficiency of conventional loss for distillation. To address the inefficiency problem and achieve label-free KD, we propose to employ the contrastive loss from self-supervised learning for distillation. Extensive experiments are conducted on a collection of public speech datasets from diverse sources. Results on light-weight SR models show that the proposed approach of label-free KD with contrastive loss consistently outperforms both conventional distillation methods and self-supervised learning methods by a significant margin.

  • Conference Article
  • Cite Count Icon 1
  • 10.1109/icpeca51329.2021.9362719
Knowledge distillation application technology for Chinese NLP
  • Jan 22, 2021
  • Hanwen Luo + 4 more

At this stage, the popular deep neural network models often encounter problems of high latency, difficult deployment and high hardware requirements in practical applications. Knowledge distillation is a good approach to solve these problems. We adopted an innovative knowledge distillation approach and formulated data augmentation strategies for the tasks, and obtained a lightweight model with 6. 7x acceleration ratio and 13. 6x compression ratio compared to the baseline model BERT-base, and the average performance of the lightweight model reached 95% of BERT-base for each task. We continue to conduct in-depth research to investigate some of the issues that remain in the knowledge distillation phase. To address the problems in distillation model selection and model fine-tuning, we propose a teacher model and student model selection strategy and a two-stage model fine-tuning strategy before and after the knowledge distillation stage. These two strategies further improve the average performance of the models to 98% of BERT-base. Finally, we developed a lightweight model evaluation scheme based on different types of downstream tasks, which provides a reference for subsequent practical applications when encountering similar tasks.

  • Research Article
  • Cite Count Icon 7
  • 10.1038/s41598-024-69813-6
Cosine similarity-guided knowledge distillation for robust object detectors
  • Aug 14, 2024
  • Scientific Reports
  • Sangwoo Park + 2 more

This paper presents a Cosine Similarity-Based Knowledge Distillation (CSKD) for robust, lightweight object detectors. Knowledge Distillation (KD) has been effective in enhancing the performance of compact models in image classification by leveraging deep CNN models. However, the complex and multifaceted nature of object detection, characterized by its modular design and multitasking requirements, poses significant challenges for traditional KD techniques. These challenges are further compounded by the conventional reliance on the Mean Squared Error (MSE) loss function and the limited application of enhanced feature representations to the training phase. Addressing these limitations, the proposed CSKD method combines cosine similarity guidance with MSE loss to facilitate a more effective knowledge transfer from the teacher model to the student model. This is achieved by distilling both intermediate features and prediction outputs, aided by an assistant prediction branch designed to learn directly from the teacher’s predictions. This dual-faceted distillation strategy enables the student model to better mimic the teacher model’s behavior, leading to improved performance. The proposed method demonstrates versatility and robustness across various object detector architectures without the need for additional feature enhancement layers during training. Notably, employing ResNet-50 as the teacher model and ResNet-18 as the student model, we achieve new benchmarks in KD for object detection across several architectures, including Faster-RCNN, RetinaNet, FCOS, and GFL, with respective mAP scores of 36.6, 35.2, 35.9, and 38.9. These results highlights the effectiveness of CSKD in advancing the state-of-the-art in KD for object detection, offering a compelling solution to the challenges previously faced by traditional KD methods in this domain. The code of the proposed CSKD is available at https://github.com/swkdn16/CSKD.

  • Research Article
  • 10.1016/j.iswa.2026.200638
Leveraging knowledge distillation for lightweight and interpretable deep learning in Ethiopian medicinal plant classification
  • May 1, 2026
  • Intelligent Systems with Applications
  • Mulugeta Adibaru Kiflie

Leveraging knowledge distillation for lightweight and interpretable deep learning in Ethiopian medicinal plant classification

  • Book Chapter
  • Cite Count Icon 10
  • 10.1016/b978-0-32-385787-1.00013-0
Chapter 8 - Knowledge distillation
  • Jan 1, 2022
  • Deep Learning for Robot Perception and Cognition
  • Nikolaos Passalis + 2 more

Chapter 8 - Knowledge distillation

  • Research Article
  • Cite Count Icon 26
  • 10.1109/tcbb.2023.3272333
Explainable Knowledge Distillation for On-Device Chest X-Ray Classification.
  • Jul 1, 2024
  • IEEE/ACM transactions on computational biology and bioinformatics
  • Chakkrit Termritthikun + 4 more

Automated multi-label chest X-rays (CXR) image classification has achieved substantial progress in clinical diagnosis via utilizing sophisticated deep learning approaches. However, most deep models have high computational demands, which makes them less feasible for compact devices with low computational requirements. To overcome this problem, we propose a knowledge distillation (KD) strategy to create the compact deep learning model for the real-time multi-label CXR image classification. We study different alternatives of CNNs and Transforms as the teacher to distill the knowledge to a smaller student. Then, we employed explainable artificial intelligence (XAI) to provide the visual explanation for the model decision improved by the KD. Our results on three benchmark CXR datasets show that our KD strategy provides the improved performance on the compact student model, thus being the feasible choice for many limited hardware platforms. For instance, when using DenseNet161 as the teacher network, EEEA-Net-C2 achieved an AUC of 83.7%, 87.1%, and 88.7% on the ChestX-ray14, CheXpert, and PadChest datasets, respectively, with fewer parameters of 4.7 million and computational cost of 0.3 billion FLOPS.

Save Icon
Up Arrow
Open/Close
Setting-up Chat
Loading Interface