Counterclockwise block-by-block knowledge distillation for neural network compression
Model compression is a technique for transforming large neural network models into smaller ones. Knowledge distillation (KD) is a crucial model compression technique that involves transferring knowledge from a large teacher model to a lightweight student model. Existing knowledge distillation methods typically facilitate the knowledge transfer from teacher to student models in one or two stages. This paper introduces a novel approach called counterclockwise block-wise knowledge distillation (CBKD) to optimize the knowledge distillation process. The core idea of CBKD aims to mitigate the generation gap between teacher and student models, facilitating the transmission of intermediate-layer knowledge from the teacher model. It divides both teacher and student models into multiple sub-network blocks, and in each stage of knowledge distillation, only the knowledge from one teacher sub-block is transferred to the corresponding position of a student sub-block. Additionally, in the CBKD process, deeper teacher sub-network blocks are assigned higher compression rates. Extensive experiments on tiny-imagenet-200 and CIFAR-10 demonstrate that the proposed CBKD method can enhance the distillation performance of various mainstream knowledge distillation approaches.
- Research Article
1
- 10.17586/2226-1494-2025-25-4-737-743
- Aug 29, 2025
- Scientific and Technical Journal of Information Technologies, Mechanics and Optics
The problem of optimizing large neural networks is discussed using the example of language models. The size of large language models is an obstacle to their practical application in conditions of limited amounts of computing resources and memory. One of the areas of compression of large neural network models being developed is knowledge distillation, the transfer of knowledge from a large teacher model to a smaller student model without significant loss of result accuracy. Currently known methods of distilling knowledge have certain disadvantages: inaccurate knowledge transfer, long learning process, accumulation of errors in long sequences. The methods that contribute to improving the quality of knowledge distillation in relation to language models are proposed: selective teacher intervention in the student’s learning process and low-level adaptation. The first approach is based on the transfer of teacher tokens when teaching a student to neural network layers, for which an exponentially decreasing threshold of measuring the discrepancy between the probability distributions of the teacher and the student is reached. The second approach suggests reducing the number of parameters in a neural network by replacing fully connected layers with low-rank ones, which reduces the risk of overfitting and speeds up the learning process. The limitations of each method when working with long sequences are shown. It is proposed to combine methods to obtain an improved model of classical distillation of knowledge for long sequences. The use of a combined approach to distilling knowledge on long sequences made it possible to significantly compress the resulting model with a slight loss of quality as well as significantly reduce GPU memory consumption and response output time. Complementary approaches to optimizing the knowledge transfer process and model compression showed better results than selective teacher intervention in the student learning process and low-rank adaptation separately. Thus, the quality of answers of the improved classical knowledge distillation model on long sequences showed 97 % of the quality of full fine-tuning and 98 % of the quality of the low-rank adaptation method in terms of ROGUE-L and Perplexity, given that the number of trainable parameters is reduced by 99 % compared to full fine-tuning and by 49 % compared to low-rank adaptation. In addition, GPU memory usage is reduced by 75 % and 30 %, respectively, and inference time by 30 %. The proposed combination of knowledge distillation methods can find application in problems with limited computational resources.
- Book Chapter
10
- 10.1016/b978-0-32-385787-1.00013-0
- Jan 1, 2022
- Deep Learning for Robot Perception and Cognition
Chapter 8 - Knowledge distillation
- Research Article
17
- 10.3390/electronics13224530
- Nov 18, 2024
- Electronics
The rapid evolution of deep learning has led to significant achievements in computer vision, primarily driven by complex convolutional neural networks (CNNs). However, the increasing depth and parameter count of these networks often result in overfitting and elevated computational demands. Knowledge distillation (KD) has emerged as a promising technique to address these issues by transferring knowledge from a large, well-trained teacher model to a more compact student model. This paper introduces a novel knowledge distillation method that simplifies the distillation process and narrows the performance gap between teacher and student models without relying on intricate knowledge representations. Our approach leverages a unique teacher network architecture designed to enhance the efficiency and effectiveness of knowledge transfer. Additionally, we introduce a streamlined teacher network architecture that transfers knowledge effectively through a simplified distillation process, enabling the student model to achieve high accuracy with reduced computational demands. Comprehensive experiments conducted on the CIFAR-10 dataset demonstrate that our proposed model achieves superior performance compared to traditional KD methods and established architectures such as ResNet and VGG networks. The proposed method not only maintains high accuracy but also significantly reduces training and validation losses. Key findings highlight the optimal hyperparameter settings (temperature T = 15.0 and smoothing factor α = 0.7), which yield the highest validation accuracy and lowest loss values. This research contributes to the theoretical and practical advancements in knowledge distillation, providing a robust framework for future applications and research in neural network compression and optimization. The simplicity and efficiency of our approach pave the way for more accessible and scalable solutions in deep learning model deployment.
- Research Article
2
- 10.3390/electronics11193018
- Sep 22, 2022
- Electronics
Deep learning is used for automatic modulation recognition in neural networks, and because of the need for high classification accuracy, deeper and deeper networks are used. However, these are computationally very expensive for neural network training and inference, so its utility in the case of a mobile with memory limitations or weak computational power is questionable. As a result, a trade-off between network depth and network classification accuracy must be considered. To address this issue, we used a knowledge distillation method in this study to improve the classification accuracy of a small network model. First, we trained Inception–Resnet as a teacher network, which has a size of 311.77 MB and a final peak classification accuracy of 93.09%. We used the method to train convolutional neural network 3 (CNN3) and increase its peak classification accuracy from 79.81 to 89.36%, with a network size of 0.37 MB. It was also used similarly to train mini Inception–Resnet and increase its peak accuracy from 84.18 to 93.59%, with a network size of 39.69 MB. When we compared all classification accuracy peaks, we discover that knowledge distillation improved small networks and that the student network had the potential to outperform the teacher network. Using knowledge distillation, a small network model can achieve the classification accuracy of a large network model. In practice, choosing the appropriate student network based on the constraints of the usage conditions while using knowledge distillation (KD) would be a way to meet practical needs.
- Research Article
5
- 10.1063/5.0255692
- Mar 1, 2025
- Physics of Fluids
Deep learning has shown great potential in improving the efficiency of airfoil flow field prediction by reducing the computational cost compared to traditional numerical methods. However, the large number of parameters in deep learning models can lead to excessive resource consumption, hurting their performance in real-time applications. To address these challenges, we propose a novel compression mechanism called Physics-Informed Neural Network Compression Mechanism (PINNCoM) to reduce model size and improve efficiency. PINNCoM consists of two stages: knowledge distillation and self-adaptive pruning. The knowledge distillation extracts key parameters from a given teacher model, i.e., a neural network model for airfoil flow field prediction, to construct a student model. By designing a physical information loss term based on the Navier–Stokes equations during the knowledge distillation, the student model can maintain fewer parameters and accurately predict the flow field in the meantime. The second stage is self-adaptive pruning, which further compresses the student model by removing redundant channels in the network while preserving its accuracy. Specifically, a reward function is designed to incorporate both physical and channel information to ensure the prediction results align with physical laws while prioritizing critical channels for retention, enabling a flexible and efficient pruning mechanism. Experimental results on airfoil flow field prediction datasets demonstrate that PINNCoM effectively reduces computational complexity with minimal accuracy loss. The proposed PINNCoM mechanism innovatively integrates physical knowledge distillation with adaptive pruning to ensure both model efficiency and physical consistency, providing a new paradigm for physically constrained neural network compression in fluid dynamics applications.
- Research Article
2
- 10.1177/15501477211057037
- Nov 1, 2021
- International Journal of Distributed Sensor Networks
Deep neural networks have achieved a great success in a variety of applications, such as self-driving cars and intelligent robotics. Meanwhile, knowledge distillation has received increasing attention as an effective model compression technique for training very efficient deep models. The performance of the student network obtained through knowledge distillation heavily depends on whether the transfer of the teacher’s knowledge can effectively guide the student training. However, most existing knowledge distillation schemes require a large teacher network pre-trained on large-scale data sets, which can increase the difficulty of knowledge distillation in different applications. In this article, we propose a feature fusion-based collaborative learning for knowledge distillation. Specifically, during knowledge distillation, it enables networks to learn from each other using the feature/response-based knowledge in different network layers. We concatenate the features learned by the teacher and the student networks to obtain a more representative feature map for knowledge transfer. In addition, we also introduce a network regularization method to further improve the model performance by providing a positive knowledge during training. Experiments and ablation studies on two widely used data sets demonstrate that the proposed method, feature fusion-based collaborative learning, significantly outperforms recent state-of-the-art knowledge distillation methods.
- Conference Article
- 10.1117/12.2678883
- May 25, 2023
An important component of intelligent driving technology is the recognition of traffic signs based on convolutional neural networks (CNN). How to design a traffic sign recognition system with high accuracy and good real-time performance is crucial for the safe driving of vehicles. For the current traffic sign detection algorithm, there are high network complexity, large amount of computation, and high difficulty in edge deployment. This paper proposes a deep neural network compression strategy, which skillfully uses model lightweight, pruning, knowledge distillation and quantification methods. The lightweight full connection layer is used to accelerate reasoning, and the knowledge distillation technology is innovatively used to assist the pruned student network to recover the lost accuracy. The teacher network is used to help pruning restore the original accuracy better, improve the generalization ability, and avoid that the small network cannot work after excessive pruning, so as to achieve a higher pruning rate. This experiment shows that knowledge distillation can assist pruning recovery in a more accurate manner than ordinary pruning. On the traffic sign GTSRB dataset, the mainstream network models VGGNet and AlexNet are used for training and testing. The models are compared before and after compression. Based on the results, the model can be compressed to 0.08% and will have a 97.32% accuracy.
- Conference Article
10
- 10.1109/icassp49357.2023.10095109
- Jun 4, 2023
Knowledge distillation (KD) is a machine learning technique widely used in recent years for the task of domain adaptation and complexity reduction. It relies on a Student-Teacher mechanism to transfer the knowledge of a large and complex Teacher network into a smaller Student model. Given the inherent complexity of large Deep Neural Network (DNN) models, and the need for deployment on edge devices with limited resources, complexity reduction techniques have become a hot topic in the Non-intrusive Load Monitoring (NILM) community. Recent literature in NILM has devoted increased effort to domain adaptation and architecture reduction via KD. However, the mechanism behind the transfer of knowledge from the Teacher to the Student is not clearly understood. In this work, we aim to address the aforementioned issue by placing the KD NILM approach in a framework of explainable AI (XAI). We identify the main inconsistency in the transfer of explainable knowledge, and exploit this information to propose a method for improvement of KD through explainability guided learning. We evaluate our approach on a variety of appliances and domain adaptation scenarios and demonstrate that solving inconsistencies in the transfer of explainable knowledge can lead to improvement in predictive performance.
- Research Article
- 10.1109/jsen.2024.3517653
- Feb 1, 2025
- IEEE sensors journal
The analysis of wearable sensor data has enabled many successes in several applications. To represent the high-sampling rate time-series with sufficient detail, the use of topological data analysis (TDA) has been considered, and it is found that TDA can complement other time-series features. Nonetheless, due to the large time consumption and high computational resource requirements of extracting topological features through TDA, it is difficult to deploy topological knowledge in machine learning and various applications. In order to tackle this problem, knowledge distillation (KD) can be adopted, which is a technique facilitating model compression and transfer learning to generate a smaller model by transferring knowledge from a larger network. By leveraging multiple teachers in KD, both time-series and topological features can be transferred, and finally, a superior student using only time-series data is distilled. On the other hand, mixup has been popularly used as a robust data augmentation technique to enhance model performance during training. Mixup and KD employ similar learning strategies. In KD, the student model learns from the smoothed distribution generated by the teacher model, while mixup creates smoothed labels by blending two labels. Hence, this common smoothness serves as the connecting link that establishes a connection between these two methods. Even though it has been widely studied to understand the interplay between mixup and KD, most of them are focused on image based analysis only, and it still remains to be understood how mixup behaves in the context of KD for incorporating multimodal data, such as both time-series and topological knowledge using wearable sensor data. In this paper, we analyze the role of mixup in KD with time-series as well as topological persistence, employing multiple teachers. We present a comprehensive analysis of various methods in KD and mixup, supported by empirical results on wearable sensor data. We observe that applying mixup to training a student in KD improves performance. We suggest a general set of recommendations to obtain an enhanced student.
- Research Article
16
- 10.1016/j.ins.2019.10.074
- Nov 1, 2019
- Information Sciences
Block change learning for knowledge distillation
- Conference Article
4
- 10.1145/3589334.3645440
- May 13, 2024
Unsupervised semantic hashing has emerged as an indispensable technique for fast image search, which aims to convert images into binary hash codes without relying on labels. Recent advancements in the field demonstrate that employing large-scale backbones (e.g., ViT) in unsupervised semantic hashing models can yield substantial improvements. However, the inference delay has become increasingly difficult to overlook. Knowledge distillation provides a means for practical model compression to alleviate this delay. Nevertheless, the prevailing knowledge distillation approaches are not explicitly designed for semantic hashing. They ignore the unique search paradigm of semantic hashing, the inherent necessities of the distillation process, and the property of hash codes. In this paper, we propose an innovative Bit-mask Robust Contrastive knowledge Distillation (BRCD) method, specifically devised for the distillation of semantic hashing models. To ensure the effectiveness of two kinds of search paradigms in the context of semantic hashing, BRCD first aligns the semantic spaces between the teacher and student models through a contrastive knowledge distillation objective. Additionally, to eliminate noisy augmentations and ensure robust optimization, a cluster-based method within the knowledge distillation process is introduced. Furthermore, through a bit-level analysis, we uncover the presence of redundancy bits resulting from the bit independence property. To mitigate these effects, we introduce a bit mask mechanism in our knowledge distillation objective. Finally, extensive experiments not only showcase the noteworthy performance of our BRCD method in comparison to other knowledge distillation methods but also substantiate the generality of our methods across diverse semantic hashing models and backbones. The code for BRCD is available at https://github.com/hly1998/BRCD.
- Research Article
2
- 10.3390/s24051612
- Mar 1, 2024
- Sensors
An equalizer based on a recurrent neural network (RNN), especially with a bidirectional gated recurrent unit (biGRU) structure, is a good choice to deal with nonlinear damage and inter-symbol interference (ISI) in optical communication systems because of its excellent performance in processing time series information. However, its recursive structure prevents the parallelization of the computation, resulting in a low equalization rate. In order to improve the speed without compromising the equalization performance, we propose a minimalist 1D convolutional neural network (CNN) equalizer, which is reconverted from a biGRU with knowledge distillation (KD). In this work, we applied KD to regression problems and explain how KD helps students learn from teachers in solving regression problems. In addition, we compared the biGRU, 1D-CNN after KD and 1D-CNN without KD in terms of Q-factor and equalization velocity. The experimental data showed that the Q-factor of the 1D-CNN increased by 1 dB after KD learning from the biGRU, and KD increased the RoP sensitivity of the 1D-CNN by 0.89 dB with the HD-FEC threshold of 1 × 10-3. At the same time, compared with the biGRU, the proposed 1D-CNN equalizer reduced the computational time consumption by 97% and the number of trainable parameters by 99.3%, with only a 0.5 dB Q-factor penalty. The results demonstrate that the proposed minimalist 1D-CNN equalizer holds significant promise for future practical deployments in optical wireless communication systems.
- Research Article
12
- 10.1016/j.neucom.2024.127516
- Mar 5, 2024
- Neurocomputing
Multi-perspective analysis on data augmentation in knowledge distillation
- Research Article
- 10.1109/tpami.2025.3647862
- Jan 1, 2025
- IEEE transactions on pattern analysis and machine intelligence
Overfitting in deep neural networks occurs less frequently than expected. This is a puzzling observation, as theory predicts that greater model capacity should eventually lead to overfitting - yet this is rarely seen in practice. But what if overfitting does occur, not globally, but in specific sub-regions of the data space? In this work, we introduce a novel score that measures the forgetting rate of deep models on validation data, capturing what we term local overfitting: a performance degradation confined to certain regions of the input space. We demonstrate that local overfitting can arise even without conventional overfitting, and is closely linked to the double descent phenomenon. Building on these insights, we introduce a two-stage approach that leverages the training history of a single model to recover and retain forgotten knowledge: first, by aggregating checkpoints into an ensemble, and then by distilling it into a single model of the original size, thus enhancing performance without added inference cost. Extensive experiments across multiple datasets, modern architectures, and training regimes validate the effectiveness of our approach. Notably, in the presence of label noise, our method - Knowledge Fusion followed by Knowledge Distillation - outperforms both the original model and independently trained ensembles, achieving a rare win-win scenario: reduced training and inference complexity.
- Research Article
3294
- 10.1007/s11263-021-01453-z
- Mar 22, 2021
- International Journal of Computer Vision
In recent years, deep neural networks have been successful in both industry and academia, especially for computer vision tasks. The great success of deep learning is mainly due to its scalability to encode large-scale data and to maneuver billions of model parameters. However, it is a challenge to deploy these cumbersome deep models on devices with limited resources, e.g., mobile phones and embedded devices, not only because of the high computational complexity but also the large storage requirements. To this end, a variety of model compression and acceleration techniques have been developed. As a representative type of model compression and acceleration, knowledge distillation effectively learns a small student model from a large teacher model. It has received rapid increasing attention from the community. This paper provides a comprehensive survey of knowledge distillation from the perspectives of knowledge categories, training schemes, teacher-student architecture, distillation algorithms, performance comparison and applications. Furthermore, challenges in knowledge distillation are briefly reviewed and comments on future research are discussed and forwarded.