Adaptive Teacher Finetune: Towards high-performance knowledge distillation through adaptive fine-tuning
Knowledge distillation is a widely used method to transfer knowledge from a large model to a small model. Traditional methods use pre-trained large models to supervise the training of small models, called Offline Knowledge Distillation, However, the structural gap between teachers and students limits its performance. After that, Oneline Knowledge Distillation retrained the teacher-student network from the beginning and the method of echo teaching greatly improved the performance. But there is very little work to explore the difference between the two. In this paper, we first point out that the essential difference between Offline and Oneline Knowledge Distillation is actually whether the weight of the teacher-student network has a process of mutual adaptation. If they adopt the teacher network and the student network jointly train to implement Offline Knowledge Distillation, there is no obvious difference in the final performance, no matter whether it is a joint distillation training. This shows that teacher-student network adaptation is important for Knowledge Distillation. Then, we propose an Adaptive Teacher Finetune (ATF) to adapt the teacher model to the student network. It will use student model information for Tinetune during the Offline Knowledge Distillation process. With normalized logical distribution and alpha-divergence, the performance improvement of ATF clearly exceeds the existing Offline and Oneline Knowledge Distillation method. Extensive experiments conducted on cifar and ImageNet support our aforementioned analysis and conclusions. With the newly introduced ATF, we obtained state-of-the-art performance on ResNet 18 on ImageNet.
- Research Article
12
- 10.1038/s41598-023-43986-y
- Oct 26, 2023
- Scientific Reports
Existing knowledge distillation (KD) methods are mainly based on features, logic, or attention, where features and logic represent the results of reasoning at different stages of a convolutional neural network, and attention maps symbolize the reasoning process. Because of the continuity of the two in time, transferring only one of them to the student network will lead to unsatisfactory results. We study the knowledge transfer between the teacher-student network to different degrees, revealing the importance of simultaneously transferring knowledge related to the reasoning process and reasoning results to the student network, providing a new perspective for the study of KD. On this basis, we proposed the knowledge distillation method based on attention and feature transfer (AFT-KD). First, we use transformation structures to transform intermediate features into attentional and feature block (AFB) that contain both inference process information and inference outcome information, and force students to learn the knowledge in AFBs. To save computation in the learning process, we use block operations to align the teacher-student network. In addition, in order to balance the attenuation ratio between different losses, we design an adaptive loss function based on the loss optimization rate. Experiments have shown that AFT-KD achieves state-of-the-art performance in multiple benchmark tests.
- Research Article
4
- 10.3390/electronics13081595
- Apr 22, 2024
- Electronics
Object detection based on Knowledge Distillation can enhance the capabilities and performance of 5G and 6G networks in various domains, such as autonomous vehicles, smart surveillance, and augmented reality. The integration of object detection with Knowledge Distillation techniques is expected to play a pivotal role in realizing the full potential of these networks. This study presents Shared Knowledge Distillation (Shared-KD) as a solution to overcome optimization challenges caused by disparities in cross-layer features between teacher–student networks. The significant gaps in intermediate-level features between teachers and students present a considerable obstacle to the efficacy of distillation. To tackle this issue, we draw inspiration from collaborative learning in real-world education, where teachers work together to prepare lessons and students engage in peer learning. Building upon this concept, our innovative contributions in model construction are highlighted as follows: (1) A teacher knowledge augmentation module: this module is proposed to combine lower-level teacher features, facilitating the knowledge transfer from the teacher to the student. (2) A student mutual learning module is introduced to enable students to learn from each other, mimicking the peer learning concept in collaborative learning. (3) The Teacher Share Module combines lower-level teacher features: the specific functionality of the teacher knowledge augmentation module is described, which involves combining lower-level teacher features. (4) The multi-step transfer process can be easily optimized due to the minimal gap between the features: the proposed approach breaks down the knowledge transfer process into multiple steps, which can be easily optimized due to the minimal gap between the features involved in each step. Shared-KD uses simple feature losses without additional weights in transformation, resulting in an efficient distillation process that can be easily combined with other methods for further improvement. The effectiveness of our approach is validated through experiments on popular tasks such as object detection and instance segmentation.
- Research Article
60
- 10.1016/j.neucom.2022.10.083
- Nov 11, 2022
- Neurocomputing
Low-light image enhancement with knowledge distillation
- Research Article
5
- 10.3390/e25010125
- Jan 7, 2023
- Entropy (Basel, Switzerland)
As a popular research subject in the field of computer vision, knowledge distillation (KD) is widely used in semantic segmentation (SS). However, based on the learning paradigm of the teacher-student model, the poor quality of teacher network feature knowledge still hinders the development of KD technology. In this paper, we investigate the output features of the teacher-student network and propose a feature condensation-based KD network (FCKDNet), which reduces pseudo-knowledge transfer in the teacher-student network. First, combined with the pixel information entropy calculation rule, we design a feature condensation method to separate the foreground feature knowledge from the background noise of the teacher network outputs. Then, the obtained feature condensation matrix is applied to the original outputs of the teacher and student networks to improve the feature representation capability. In addition, after performing feature condensation on the teacher network, we propose a soft enhancement method of features based on spatial and channel dimensions to improve the dependency of pixels in the feature maps. Finally, we divide the outputs of the teacher network into spatial condensation features and channel condensation features and perform distillation loss calculation with the student network separately to assist the student network to converge faster. Extensive experiments on the public datasets Pascal VOC and Cityscapes demonstrate that our proposed method improves the baseline by 3.16% and 2.98% in terms of mAcc, and 2.03% and 2.30% in terms of mIoU, respectively, and has better segmentation performance and robustness than the mainstream methods.
- Conference Article
1
- 10.1145/3442555.3442581
- Nov 27, 2020
Knowledge distillation is dedicated to improving the performance of light weight networks by transferring knowledge during the training process. Meanwhile, it is important to apply knowledge distillation on different situations. The previous knowledge distillation method with adversarial samples uses a traditional knowledge distillation loss to let the student learn a good decision boundary. In this paper, we propose a novel method named Adversarial Metric Knowledge Distillation (AMKD), which utilizes adversarial samples to transfer the dark knowledge from the teacher to student. We select adversarial samples which are close to the decision boundary of two classes to metric the distance with negative class samples employing triplet loss constraint. The method guarantees the student network learning relationships among samples by quantitative metric learning. Therefore, we not only transfer information of the decision boundary but also ensure the student network can always maintain a proper distance from other negative classes. This can be another good exploration for knowledge distillation with adversarial samples. The experiments on CIFAR-10, CIFAR-100 and Tiny ImageNet datasets verify that the proposed knowledge distillation method works effectively on improving the student network performance.
- Research Article
12
- 10.1016/j.neucom.2024.127516
- Mar 5, 2024
- Neurocomputing
Multi-perspective analysis on data augmentation in knowledge distillation
- Conference Article
- 10.1109/icitee56407.2022.9954068
- Oct 18, 2022
Computer vision research has been used in daily applications, such as art, social media app filter, and face recognition. This emergence is because of the usage of the deep learning method in the computer vision domain. Deep learning research has improved many qualities of services for various applications. Starting from recommended until detection systems are now relying on deep learning models. However, currently many models require high computational processing and storage space. Implementing such an extensive network with limited resources on an embedded device or smartphone becomes more challenging. In this study, we focus on developing a model with small computational resources with high accuracy using the knowledge distillation method. We evaluate our model on the public and private datasets of receipt and non-receipt images that we gathered from Badan Pendapatan Daerah, CORD, and Kaggle dataset. After that, we compare it with the regular convolutional neural network (CNN) and pre-trained model. We discovered that knowledge distillation only uses 12% and 5% of the total weight of the CNN and the pre-trained model, respectively. As a result, we see a possibility that knowledge distillation illustrates potential outcomes as a method that could implement for automatic receipt identification in the Jakarta Super App.
- Research Article
6
- 10.1016/j.dsp.2024.104512
- Apr 17, 2024
- Digital Signal Processing
Discretization and decoupled knowledge distillation for arbitrary oriented object detection
- Conference Article
4
- 10.1145/3589334.3645440
- May 13, 2024
Unsupervised semantic hashing has emerged as an indispensable technique for fast image search, which aims to convert images into binary hash codes without relying on labels. Recent advancements in the field demonstrate that employing large-scale backbones (e.g., ViT) in unsupervised semantic hashing models can yield substantial improvements. However, the inference delay has become increasingly difficult to overlook. Knowledge distillation provides a means for practical model compression to alleviate this delay. Nevertheless, the prevailing knowledge distillation approaches are not explicitly designed for semantic hashing. They ignore the unique search paradigm of semantic hashing, the inherent necessities of the distillation process, and the property of hash codes. In this paper, we propose an innovative Bit-mask Robust Contrastive knowledge Distillation (BRCD) method, specifically devised for the distillation of semantic hashing models. To ensure the effectiveness of two kinds of search paradigms in the context of semantic hashing, BRCD first aligns the semantic spaces between the teacher and student models through a contrastive knowledge distillation objective. Additionally, to eliminate noisy augmentations and ensure robust optimization, a cluster-based method within the knowledge distillation process is introduced. Furthermore, through a bit-level analysis, we uncover the presence of redundancy bits resulting from the bit independence property. To mitigate these effects, we introduce a bit mask mechanism in our knowledge distillation objective. Finally, extensive experiments not only showcase the noteworthy performance of our BRCD method in comparison to other knowledge distillation methods but also substantiate the generality of our methods across diverse semantic hashing models and backbones. The code for BRCD is available at https://github.com/hly1998/BRCD.
- Research Article
1
- 10.1145/3721984
- Mar 15, 2025
- Journal of Data and Information Quality
With the continuous development of remote sensing technology, the data volume of high-resolution is increasing with the large coverage of high-resolution remote sensing images, changeable objects, and complex backgrounds. However, the sensitivity field of current convolutional neural networks is relatively small. This makes it difficult to capture information from the global context. Therefore, we propose remote sensing image classification with a Detailed Attention scheme and a Teacher-Student network named DATS to capture information in the global context. Firstly, the detailed attention scheme is used to integrate the spatial relationship of the feature graph into the feature channel. Thus feature graph is transformed into an attention map, generating structure-preserving and detail-preserving images. Then, the teacher-student network takes detail-preserving and structure-preserving images as inputs and uses feature refiners to enhance the fine-grained details of the images. Finally, fine-grained details learned from the teacher network are integrated into the main network by knowledge distillation, which achieves effective integration both in local detail features and global structure features. Experiments on FGSCR-42, WHU-RS19, and NWPU data sets showed that the Top-1 classification accuracy of our method reached 88.82%, 91.82%, and 87.60%, respectively.
- Research Article
5
- 10.1016/j.asoc.2024.111579
- Apr 9, 2024
- Applied Soft Computing
PURF: Improving teacher representations by imposing smoothness constraints for knowledge distillation
- Conference Article
4
- 10.1109/csaiee54046.2021.9543192
- Aug 20, 2021
Convolutional neural network (CNN) is the main tool for deep learning and computer vision, and it has many applications in face recognition, sign language recognition and speech recognition. As deep learning becomes more and more mature, the application of convolutional neural networks will become more and more widespread. As we know, the deeper a neural network is, the higher its memory and computational power overhead. Many neural networks used in medicine, autonomous driving, and language recognition have large model complexity, which makes it difficult to apply these CNNs to people's daily life. Therefore, the development of simple, lightweight and small neural networks has become the focus of researchers nowadays. In this paper, we summarize the development of convolutional neural networks in recent years, as well as various methods for compressing models and migrating data from large models to small ones. In general, the main convolutional neural network compression approaches are: pruning, knowledge distillation, aggregating neurons of different scales, proposing new structures, etc. We start from the structure of neural networks, introduce the major structural changes experienced from the development of convolutional neural networks, and then analyze various pruning, compression and knowledge distillation methods. For specific methods, we run different models and compare the improvements of the new methods with respect to the old ones. We also debugged models on adversarial generative pruning, teacher-student networks, and other compressed CNNs during this period, and drew some constructive conclusions. Finally, we summarize the trends in CNN development in recent years and the challenges we may face in the future.
- Research Article
46
- 10.1155/2021/4019358
- Oct 20, 2021
- Computational and Mathematical Methods in Medicine
Breast cancer is the most common invasive cancer in women and the second main cause of cancer death in females, which can be classified benign or malignant. Research and prevention on breast cancer have attracted more concern of researchers in recent years. On the other hand, the development of data mining methods provides an effective way to extract more useful information from complex databases, and some prediction, classification, and clustering can be made according to the extracted information. The generic notion of knowledge distillation is that a network of higher capacity acts as a teacher and a network of lower capacity acts as a student. There are different pipelines of knowledge distillation known. However, previous work on knowledge distillation using label smoothing regularization produces experiments and results that break this general notion and prove that knowledge distillation also works when a student model distils a teacher model, i.e., reverse knowledge distillation. Not only this, but it is also proved that a poorly trained teacher model trains a student model to reach equivalent results. Building on the ideas from those works, we propose a novel bilateral knowledge distillation regime that enables multiple interactions between teacher and student models, i.e., teaching and distilling each other, eventually improving each other's performance and evaluating our results on BACH histopathology image dataset on breast cancer. The pretrained ResNeXt29 and MobileNetV2 models which are already tested on ImageNet dataset are used for “transfer learning” in our dataset, and we obtain a final accuracy of more than 96% using this novel approach of bilateral KD.
- Book Chapter
9
- 10.1007/978-3-031-26284-5_31
- Jan 1, 2023
Knowledge distillation is an effective way to transfer knowledge from a large model to a small model, which can significantly improve the performance of the small model. In recent years, some contrastive learning-based knowledge distillation methods (i.e., SSKD and HSAKD) have achieved excellent performance by utilizing data augmentation. However, the worth of data augmentation has always been overlooked by researchers in knowledge distillation, and no work analyzes its role in particular detail. To fix this gap, we analyze the effect of data augmentation on knowledge distillation from a multi-sided perspective. In particular, we demonstrate the following properties of data augmentation: (a) data augmentation can effectively help knowledge distillation work even if the teacher model does not have the information about augmented samples, and our proposed diverse and rich Joint Data Augmentation (JDA) is more valid than single rotating in knowledge distillation; (b) using diverse and rich augmented samples to assist the teacher model in training can improve its performance, but not the performance of the student model; (c) the student model can achieve excellent performance when the proportion of augmented samples is within a suitable range; (d) data augmentation enables knowledge distillation to work better in a few-shot scenario; (e) data augmentation is seamlessly compatible with some knowledge distillation methods and can potentially further improve their performance. Enlightened by the above analysis, we propose a method named Cosine Confidence Distillation (CCD) to transfer the augmented samples’ knowledge more reasonably. And CCD achieves better performance than the latest SOTA HSAKD with fewer storage requirements on CIFAR-100 and ImageNet-1k. Our code is released at https://github.com/liwei-group/CCD.
- Conference Article
15
- 10.1109/ijcnn48605.2020.9207148
- Jul 1, 2020
In recent years, deep learning has spread rapidly, and deeper, larger models have been proposed. However, the calculation cost becomes enormous as the size of the models becomes larger. Various techniques for compressing the size of the models have been proposed to improve performance while reducing computational costs. One of the methods to compress the size of the models is knowledge distillation (KD). Knowledge distillation is a technique for transferring knowledge of deep or ensemble models with many parameters (teacher model) to smaller shallow models (student model). Since the purpose of knowledge distillation is to increase the similarity between the teacher model and the student model, we propose to introduce the concept of metric learning into knowledge distillation to make the student model closer to the teacher model using pairs or triplets of the training samples. In metric learning, the researchers are developing the methods to build a model that can increase the similarity of outputs for similar samples. Metric learning aims at reducing the distance between similar and increasing the distance between dissimilar. The functionality of the metric learning to reduce the differences between similar outputs can be used for the knowledge distillation to reduce the differences between the outputs of the teacher model and the student model. Since the outputs of the teacher model for different objects are usually different, the student model needs to distinguish them. We think that metric learning can clarify the difference between the different outputs, and the performance of the student model could be improved. We have performed experiments to compare the proposed method with state-of-the-art knowledge distillation methods. The results show that the student model obtained by the proposed method gives higher performance than the conventional knowledge distillation methods.