Accelerate Literature Icon
Want to do a literature review? Try our new Literature Review workflow

Efficient image classification through collaborative knowledge distillation: A novel AlexNet modification approach

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon

Efficient image classification through collaborative knowledge distillation: A novel AlexNet modification approach

Similar Papers
  • Research Article
  • Cite Count Icon 12
  • 10.1016/j.neucom.2024.127516
Multi-perspective analysis on data augmentation in knowledge distillation
  • Mar 5, 2024
  • Neurocomputing
  • Wei Li + 3 more

Multi-perspective analysis on data augmentation in knowledge distillation

  • Research Article
  • Cite Count Icon 5
  • 10.1016/j.asoc.2024.111579
PURF: Improving teacher representations by imposing smoothness constraints for knowledge distillation
  • Apr 9, 2024
  • Applied Soft Computing
  • Md Imtiaz Hossain + 3 more

PURF: Improving teacher representations by imposing smoothness constraints for knowledge distillation

  • Conference Article
  • 10.2118/228194-ms
Stop Using Convolutional Neural Networks: Knowledge Distillation for an Interpretable and Lightweight Decision Tree in Rod Pump Working Condition Diagnosis
  • Oct 13, 2025
  • Qiangqiang Mao + 3 more

Recent studies on rod pump working condition diagnosis have heavily relied on convolutional neural networks (CNN) for dynamometer card classification, largely due to CNNs’ success in extensive image recognition tasks. However, these models often depend on over-parameterized architectures with millions or even billions of parameters, making them unsuitable for deployment on edge devices or embedded systems with limited computational resources and energy budgets. Their large size also results in slow prediction speed, hindering real-time fault detection in field operations. Furthermore, the black-box nature of CNNs compromises interpretability, making it difficult for field operators to understand and trust the predictions. Given these limitations, decision trees present an appealing alternative due to their lightweight structure, inherent interpretability, and ease of use. Yet traditional decision trees, such as CART, suffer from a significant accuracy gap compared to CNN, particularly in image classification tasks without effective feature extraction, and thus are rarely used. This work addresses the issue by first significantly improving the accuracy of traditional decision trees through a novel tree reformulation and gradient-based entire tree optimization, avoiding the suboptimal tree model induced by the traditional greedy optimization. Built upon our optimized decision tree, we subsequently leverage knowledge distillation to further boost its image classification accuracy, allowing the tree to inherit the knowledge or learning capability of a well-trained CNN model. Experiments on two representative dynamometer card datasets demonstrate the effectiveness of our approach. Our decision tree achieves 5.32% higher accuracy than traditional CART, and knowledge distillation brings an additional 4% improvement on average, reaching accuracy comparable to CNN-based models such as ResNet-50 and Vision Transformer (ViT). Moreover, our tree model uses only 64 IF-THEN rules to connect input images with final predictions in contrast to 4,096 rules in CART, thus ensuring interpretability and transparency. Our tree model contains only 49,457 parameters, significantly fewer than 24 million in ResNet-50 and 86 million in ViT, and achieves at least a 13,000 times speedup in prediction time over ResNet and 33,000 times over ViT. This optimal balance of competitive accuracy, lightweight structure and interpretability, combined with our open-source code, makes our method a practical and reliable alternative to heavy black-box CNN classifiers for real-time deployment in dynamometer cards as well as other image classification tasks.

  • Research Article
  • Cite Count Icon 32
  • 10.1109/tip.2021.3101158
Resolution-Aware Knowledge Distillation for Efficient Inference.
  • Jan 1, 2021
  • IEEE Transactions on Image Processing
  • Zhanxiang Feng + 2 more

Minimizing the computation complexity is essential for the popularization of deep networks in practical applications. Nowadays, most researches attempt to accelerate deep networks by designing new network structure or compressing the network parameters. Meanwhile, transfer learning techniques such as knowledge distillation are utilized to keep the performance of deep models. In this paper, we focus on accelerating deep models and relieving the computation burden by using low-resolution (LR) images as inputs while maintaining competitive performance, which is rarely researched in the current literature. Deep networks may encounter serious performance degradation when using LR inputs because many details are unavailable from LR images. Besides, the existing approaches may fail to learn discriminative features for LR images because of the dramatic appearance variations between LR and high-resolution (HR) images. To tackle with the above problems, we propose a resolution-aware knowledge distillation (RKD) framework to narrow the cross-resolution variations by transferring knowledge from HR domain to LR domain. The proposed framework consists of a HR teacher network and a LR student network. First, we introduce a discriminator and propose an adversarial learning strategy to shrink the variations between inputs with changing resolution. Then we design a cross-resolution knowledge distillation (CRKD) loss to train discriminative student network by exploiting the knowledge of the teacher network. The CRKD loss is consisted of a resolution-aware distillation loss, a pair-wise constraint, and a maximum mean discrepancy loss. Experimental results on person re-identification, image classification, face recognition, and defect segmentation tasks demonstrate that RKD outperforms traditional knowledge distillation method by achieving better performance with lower computation complexities. Furthermore, CRKD surpasses the state-of-the-art knowledge distillation methods in transferring knowledge across different resolutions under RKD framework, especially when coping with large resolution differences.

  • Book Chapter
  • Cite Count Icon 18
  • 10.1007/978-3-031-20077-9_19
HEAD: HEtero-Assists Distillation for Heterogeneous Object Detectors
  • Jan 1, 2022
  • Luting Wang + 7 more

Conventional knowledge distillation (KD) methods for object detection mainly concentrate on homogeneous teacher-student detectors. However, the design of a lightweight detector for deployment is often significantly different from a high-capacity detector. Thus, we investigate KD among heterogeneous teacher-student pairs for a wide application. We observe that the core difficulty for heterogeneous KD (hetero-KD) is the significant semantic gap between the backbone features of heterogeneous detectors due to the different optimization manners. Conventional homogeneous KD (homo-KD) methods suffer from such a gap and are hard to directly obtain satisfactory performance for hetero-KD. In this paper, we propose the HEtero-Assists Distillation (HEAD) framework, leveraging heterogeneous detection heads as assistants to guide the optimization of the student detector to reduce this gap. In HEAD, the assistant is an additional detection head with the architecture homogeneous to the teacher head attached to the student backbone. Thus, a hetero-KD is transformed into a homo-KD, allowing efficient knowledge transfer from the teacher to the student. Moreover, we extend HEAD into a Teacher-Free HEAD (TF-HEAD) framework when a well-trained teacher detector is unavailable. Our method has achieved significant improvement compared to current detection KD methods. For example, on the MS-COCO dataset, TF-HEAD helps R18 RetinaNet achieve 33.9 mAP ( $$+2.2$$ ), while HEAD further pushes the limit to 36.2 mAP ( $$+4.5$$ ).

  • Research Article
  • Cite Count Icon 6
  • 10.1016/j.dsp.2024.104512
Discretization and decoupled knowledge distillation for arbitrary oriented object detection
  • Apr 17, 2024
  • Digital Signal Processing
  • Cheng Chen + 2 more

Discretization and decoupled knowledge distillation for arbitrary oriented object detection

  • Research Article
  • Cite Count Icon 57
  • 10.1109/tip.2022.3212905
A General Dynamic Knowledge Distillation Method for Visual Analytics.
  • Jan 1, 2022
  • IEEE Transactions on Image Processing
  • Zhigang Tu + 2 more

Existing knowledge distillation (KD) method normally fixes the weight of the teacher network, and uses the knowledge from the teacher network to guide the training of the student network no-ninteractively, thus it is called static knowledge distillation (SKD). SKD is widely used in model compression on the homologous data and knowledge transfer on the heterogeneous data. However, the teacher network that with fixed-weight constrains the student network to learn knowledge from it. It is worth expecting that the teacher network itself can be continuously optimized to promote the learning ability of the student network dynamically. To overcome this limitation, we propose a novel dynamic knowledge distillation (DKD) method, in which the teacher network and the student network can learn from each other interactively. Importantly, we analyzed the effectiveness of DKD mathematically (see Eq. 4), and addressed one crucial issue caused by the continuous change of the teacher network in the dynamic distillation process via designing a valid loss function. We verified the practicality of our DKD by extensive experiments on various visual tasks, e.g. for model compression, we conducted experiments on image classification and object detection. For knowledge transfer, video-based human action recognition is chosen for analysis. The experimental results on benchmark datasets (i.e. ILSVRC2012, COCO2017, HMDB51, UCF101) demonstrated that the proposed DKD is valid to improve the performance of these visual tasks for a large margin. The source code is publicly available online at1.

  • Research Article
  • Cite Count Icon 1
  • 10.59247/jahir.v2i2.289
Comparison of Transfer Learning Performance in Lung and Colon Classification with Knowledge Distillation
  • Aug 31, 2024
  • Journal of Advanced Health Informatics Research
  • Annastasya Nabila Elsa Wulandari + 3 more

This research aims to apply the knowledge distillation method to medical image classification, specifically in the case of lung and colon image classification using various transfer learning models. Knowledge distillation allows the transfer of knowledge from a larger model (teacher) to a smaller model (student), which enables more efficient model building without sacrificing accuracy. In this research, the DenseNet169 model is used as the teacher model. The student model uses several alternative transfer learning architectures such as DenseNet121, MobileNet, ResNet50, InceptionV3, and Xception. The data used consists of 25,000 histopathology images that have been processed and divided into training, validation, and test data. Data augmentation was performed to enlarge the dataset from 750 to 25,000 images, which helped improve the performance of the model. Model performance evaluation was performed by measuring the accuracy and loss value of each student model compared to the teacher model. The results showed that the student models generated through the knowledge distillation process performed close to or even exceeded the teacher model in some cases, with the Xception model showing the highest accuracy of 96.95%. In conclusion, knowledge distillation is effective in reducing model complexity without compromising performance, which is particularly beneficial for implementation on resource-constrained devices.

  • Research Article
  • Cite Count Icon 18
  • 10.3390/electronics13224530
Simplified Knowledge Distillation for Deep Neural Networks Bridging the Performance Gap with a Novel Teacher–Student Architecture
  • Nov 18, 2024
  • Electronics
  • Sabina Umirzakova + 4 more

The rapid evolution of deep learning has led to significant achievements in computer vision, primarily driven by complex convolutional neural networks (CNNs). However, the increasing depth and parameter count of these networks often result in overfitting and elevated computational demands. Knowledge distillation (KD) has emerged as a promising technique to address these issues by transferring knowledge from a large, well-trained teacher model to a more compact student model. This paper introduces a novel knowledge distillation method that simplifies the distillation process and narrows the performance gap between teacher and student models without relying on intricate knowledge representations. Our approach leverages a unique teacher network architecture designed to enhance the efficiency and effectiveness of knowledge transfer. Additionally, we introduce a streamlined teacher network architecture that transfers knowledge effectively through a simplified distillation process, enabling the student model to achieve high accuracy with reduced computational demands. Comprehensive experiments conducted on the CIFAR-10 dataset demonstrate that our proposed model achieves superior performance compared to traditional KD methods and established architectures such as ResNet and VGG networks. The proposed method not only maintains high accuracy but also significantly reduces training and validation losses. Key findings highlight the optimal hyperparameter settings (temperature T = 15.0 and smoothing factor α = 0.7), which yield the highest validation accuracy and lowest loss values. This research contributes to the theoretical and practical advancements in knowledge distillation, providing a robust framework for future applications and research in neural network compression and optimization. The simplicity and efficiency of our approach pave the way for more accessible and scalable solutions in deep learning model deployment.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 8
  • 10.3390/computers13080184
Knowledge Distillation in Image Classification: The Impact of Datasets
  • Jul 24, 2024
  • Computers
  • Ange Gabriel Belinga + 3 more

As the demand for efficient and lightweight models in image classification grows, knowledge distillation has emerged as a promising technique to transfer expertise from complex teacher models to simpler student models. However, the efficacy of knowledge distillation is intricately linked to the choice of datasets used during training. Datasets are pivotal in shaping a model’s learning process, influencing its ability to generalize and discriminate between diverse patterns. While considerable research has independently explored knowledge distillation and image classification, a comprehensive understanding of how different datasets impact knowledge distillation remains a critical gap. This study systematically investigates the impact of diverse datasets on knowledge distillation in image classification. By varying dataset characteristics such as size, domain specificity, and inherent biases, we aim to unravel the nuanced relationship between datasets and the efficacy of knowledge transfer. Our experiments employ a range of datasets to comprehensively explore their impact on the performance gains achieved through knowledge distillation. This study contributes valuable guidance for researchers and practitioners seeking to optimize image classification models through kno-featured applications. By elucidating the intricate interplay between dataset characteristics and knowledge distillation outcomes, our findings empower the community to make informed decisions when selecting datasets, ultimately advancing the field toward more robust and efficient model development.

  • Conference Article
  • Cite Count Icon 5
  • 10.1145/3589334.3645440
Bit-mask Robust Contrastive Knowledge Distillation for Unsupervised Semantic Hashing
  • May 13, 2024
  • Liyang He + 6 more

Unsupervised semantic hashing has emerged as an indispensable technique for fast image search, which aims to convert images into binary hash codes without relying on labels. Recent advancements in the field demonstrate that employing large-scale backbones (e.g., ViT) in unsupervised semantic hashing models can yield substantial improvements. However, the inference delay has become increasingly difficult to overlook. Knowledge distillation provides a means for practical model compression to alleviate this delay. Nevertheless, the prevailing knowledge distillation approaches are not explicitly designed for semantic hashing. They ignore the unique search paradigm of semantic hashing, the inherent necessities of the distillation process, and the property of hash codes. In this paper, we propose an innovative Bit-mask Robust Contrastive knowledge Distillation (BRCD) method, specifically devised for the distillation of semantic hashing models. To ensure the effectiveness of two kinds of search paradigms in the context of semantic hashing, BRCD first aligns the semantic spaces between the teacher and student models through a contrastive knowledge distillation objective. Additionally, to eliminate noisy augmentations and ensure robust optimization, a cluster-based method within the knowledge distillation process is introduced. Furthermore, through a bit-level analysis, we uncover the presence of redundancy bits resulting from the bit independence property. To mitigate these effects, we introduce a bit mask mechanism in our knowledge distillation objective. Finally, extensive experiments not only showcase the noteworthy performance of our BRCD method in comparison to other knowledge distillation methods but also substantiate the generality of our methods across diverse semantic hashing models and backbones. The code for BRCD is available at https://github.com/hly1998/BRCD.

  • Research Article
  • Cite Count Icon 1
  • 10.17586/2226-1494-2025-25-4-737-743
Optimizing knowledge distillation models for language models
  • Aug 29, 2025
  • Scientific and Technical Journal of Information Technologies, Mechanics and Optics
  • T M Tatarnikova + 1 more

The problem of optimizing large neural networks is discussed using the example of language models. The size of large language models is an obstacle to their practical application in conditions of limited amounts of computing resources and memory. One of the areas of compression of large neural network models being developed is knowledge distillation, the transfer of knowledge from a large teacher model to a smaller student model without significant loss of result accuracy. Currently known methods of distilling knowledge have certain disadvantages: inaccurate knowledge transfer, long learning process, accumulation of errors in long sequences. The methods that contribute to improving the quality of knowledge distillation in relation to language models are proposed: selective teacher intervention in the student’s learning process and low-level adaptation. The first approach is based on the transfer of teacher tokens when teaching a student to neural network layers, for which an exponentially decreasing threshold of measuring the discrepancy between the probability distributions of the teacher and the student is reached. The second approach suggests reducing the number of parameters in a neural network by replacing fully connected layers with low-rank ones, which reduces the risk of overfitting and speeds up the learning process. The limitations of each method when working with long sequences are shown. It is proposed to combine methods to obtain an improved model of classical distillation of knowledge for long sequences. The use of a combined approach to distilling knowledge on long sequences made it possible to significantly compress the resulting model with a slight loss of quality as well as significantly reduce GPU memory consumption and response output time. Complementary approaches to optimizing the knowledge transfer process and model compression showed better results than selective teacher intervention in the student learning process and low-rank adaptation separately. Thus, the quality of answers of the improved classical knowledge distillation model on long sequences showed 97 % of the quality of full fine-tuning and 98 % of the quality of the low-rank adaptation method in terms of ROGUE-L and Perplexity, given that the number of trainable parameters is reduced by 99 % compared to full fine-tuning and by 49 % compared to low-rank adaptation. In addition, GPU memory usage is reduced by 75 % and 30 %, respectively, and inference time by 30 %. The proposed combination of knowledge distillation methods can find application in problems with limited computational resources.

  • Research Article
  • Cite Count Icon 8
  • 10.1016/j.csl.2023.101583
Dual Knowledge Distillation for neural machine translation
  • Nov 9, 2023
  • Computer Speech & Language
  • Yuxian Wan + 4 more

Dual Knowledge Distillation for neural machine translation

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 34
  • 10.1049/csy2.12120
Efficient knowledge distillation for hybrid models: A vision transformer‐convolutional neural network to convolutional neural network approach for classifying remote sensing images
  • Jul 10, 2024
  • IET Cyber-Systems and Robotics
  • Huaxiang Song + 4 more

In various fields, knowledge distillation (KD) techniques that combine vision transformers (ViTs) and convolutional neural networks (CNNs) as a hybrid teacher have shown remarkable results in classification. However, in the realm of remote sensing images (RSIs), existing KD research studies are not only scarce but also lack competitiveness. This issue significantly impedes the deployment of the notable advantages of ViTs and CNNs. To tackle this, the authors introduce a novel hybrid‐model KD approach named HMKD‐Net, which comprises a CNN‐ViT ensemble teacher and a CNN student. Contrary to popular opinion, the authors posit that the sparsity in RSI data distribution limits the effectiveness and efficiency of hybrid‐model knowledge transfer. As a solution, a simple yet innovative method to handle variances during the KD phase is suggested, leading to substantial enhancements in the effectiveness and efficiency of hybrid knowledge transfer. The authors assessed the performance of HMKD‐Net on three RSI datasets. The findings indicate that HMKD‐Net significantly outperforms other cutting‐edge methods while maintaining a significantly smaller size. Specifically, HMKD‐Net exceeds other KD‐based methods with a maximum accuracy improvement of 22.8% across various datasets. As ablation experiments indicated, HMKD‐Net has cut down on time expenses by about 80% in the KD process. This research study validates that the hybrid‐model KD technique can be more effective and efficient if the data distribution sparsity in RSIs is well handled.

  • Research Article
  • 10.1016/j.neunet.2025.108229
Gradient-aware knowledge distillation: Tackling gradient insensitivity through teacher guided gradient scaling.
  • Mar 1, 2026
  • Neural networks : the official journal of the International Neural Network Society
  • Nianwen Si + 5 more

Gradient-aware knowledge distillation: Tackling gradient insensitivity through teacher guided gradient scaling.

Save Icon
Up Arrow
Open/Close
Notes

Save Important notes in documents

Highlight text to save as a note, or write notes directly

You can also access these Documents in Paperpal, our AI writing tool

Powered by our AI Writing Assistant