Feature fusion-based collaborative learning for knowledge distillation
Deep neural networks have achieved a great success in a variety of applications, such as self-driving cars and intelligent robotics. Meanwhile, knowledge distillation has received increasing attention as an effective model compression technique for training very efficient deep models. The performance of the student network obtained through knowledge distillation heavily depends on whether the transfer of the teacher’s knowledge can effectively guide the student training. However, most existing knowledge distillation schemes require a large teacher network pre-trained on large-scale data sets, which can increase the difficulty of knowledge distillation in different applications. In this article, we propose a feature fusion-based collaborative learning for knowledge distillation. Specifically, during knowledge distillation, it enables networks to learn from each other using the feature/response-based knowledge in different network layers. We concatenate the features learned by the teacher and the student networks to obtain a more representative feature map for knowledge transfer. In addition, we also introduce a network regularization method to further improve the model performance by providing a positive knowledge during training. Experiments and ablation studies on two widely used data sets demonstrate that the proposed method, feature fusion-based collaborative learning, significantly outperforms recent state-of-the-art knowledge distillation methods.
- Conference Article
3
- 10.1109/yac57282.2022.10023833
- Nov 19, 2022
Real-time semantic segmentation is a key research topic in the application field of artificial intelligence such as automatic driving and intelligent robot. At present, the consumption of storage space and computing resources of the real-time semantic segmentation model is still huge. As an efficient model compression method, knowledge distillation is widely used in various fields of computer vision. In this paper, we propose a novel knowledge distillation framework based on generative adversarial network structure, which combines spatial consistency and temporal consistency. The teacher network in this framework jointly uses the CNN branch and transformer branch to improve the spatial consistency of lightweight real-time semantic segmentation of the student network. In addition, we integrate the inter-frame relationship obtained by the optical flow network and semantic segmentation network in continuous time as the time consistency constraint of the student network. Finally, spatial consistency and temporal consistency are coupled as spatial-temporal consistency knowledge. The main purpose of our knowledge distillation method is to transfer the spatio-temporal consistency knowledge contained by teachers to students. The student network obtained by knowledge distillation can process each frame independently in the inference stage, and our knowledge distillation method does not participate in the inference process of the student network, so it will not increase the computational cost of the student network in the inference process, but it can narrow the performance gap of real-time semantic segmentation between large model and compact model. Using our method, we can get a high-performance and efficient lightweight model. Finally, we verify the effectiveness of our proposed method on the Camvid dataset and the Cityscapes dataset.
- Research Article
6
- 10.1016/j.dsp.2024.104512
- Apr 17, 2024
- Digital Signal Processing
Discretization and decoupled knowledge distillation for arbitrary oriented object detection
- Conference Article
4
- 10.1145/3589334.3645440
- May 13, 2024
Unsupervised semantic hashing has emerged as an indispensable technique for fast image search, which aims to convert images into binary hash codes without relying on labels. Recent advancements in the field demonstrate that employing large-scale backbones (e.g., ViT) in unsupervised semantic hashing models can yield substantial improvements. However, the inference delay has become increasingly difficult to overlook. Knowledge distillation provides a means for practical model compression to alleviate this delay. Nevertheless, the prevailing knowledge distillation approaches are not explicitly designed for semantic hashing. They ignore the unique search paradigm of semantic hashing, the inherent necessities of the distillation process, and the property of hash codes. In this paper, we propose an innovative Bit-mask Robust Contrastive knowledge Distillation (BRCD) method, specifically devised for the distillation of semantic hashing models. To ensure the effectiveness of two kinds of search paradigms in the context of semantic hashing, BRCD first aligns the semantic spaces between the teacher and student models through a contrastive knowledge distillation objective. Additionally, to eliminate noisy augmentations and ensure robust optimization, a cluster-based method within the knowledge distillation process is introduced. Furthermore, through a bit-level analysis, we uncover the presence of redundancy bits resulting from the bit independence property. To mitigate these effects, we introduce a bit mask mechanism in our knowledge distillation objective. Finally, extensive experiments not only showcase the noteworthy performance of our BRCD method in comparison to other knowledge distillation methods but also substantiate the generality of our methods across diverse semantic hashing models and backbones. The code for BRCD is available at https://github.com/hly1998/BRCD.
- Research Article
8
- 10.1016/j.csl.2023.101583
- Nov 9, 2023
- Computer Speech & Language
Dual Knowledge Distillation for neural machine translation
- Research Article
5
- 10.1016/j.asoc.2024.111579
- Apr 9, 2024
- Applied Soft Computing
PURF: Improving teacher representations by imposing smoothness constraints for knowledge distillation
- Conference Article
15
- 10.1109/ijcnn48605.2020.9207148
- Jul 1, 2020
In recent years, deep learning has spread rapidly, and deeper, larger models have been proposed. However, the calculation cost becomes enormous as the size of the models becomes larger. Various techniques for compressing the size of the models have been proposed to improve performance while reducing computational costs. One of the methods to compress the size of the models is knowledge distillation (KD). Knowledge distillation is a technique for transferring knowledge of deep or ensemble models with many parameters (teacher model) to smaller shallow models (student model). Since the purpose of knowledge distillation is to increase the similarity between the teacher model and the student model, we propose to introduce the concept of metric learning into knowledge distillation to make the student model closer to the teacher model using pairs or triplets of the training samples. In metric learning, the researchers are developing the methods to build a model that can increase the similarity of outputs for similar samples. Metric learning aims at reducing the distance between similar and increasing the distance between dissimilar. The functionality of the metric learning to reduce the differences between similar outputs can be used for the knowledge distillation to reduce the differences between the outputs of the teacher model and the student model. Since the outputs of the teacher model for different objects are usually different, the student model needs to distinguish them. We think that metric learning can clarify the difference between the different outputs, and the performance of the student model could be improved. We have performed experiments to compare the proposed method with state-of-the-art knowledge distillation methods. The results show that the student model obtained by the proposed method gives higher performance than the conventional knowledge distillation methods.
- Conference Article
1
- 10.1145/3442555.3442581
- Nov 27, 2020
Knowledge distillation is dedicated to improving the performance of light weight networks by transferring knowledge during the training process. Meanwhile, it is important to apply knowledge distillation on different situations. The previous knowledge distillation method with adversarial samples uses a traditional knowledge distillation loss to let the student learn a good decision boundary. In this paper, we propose a novel method named Adversarial Metric Knowledge Distillation (AMKD), which utilizes adversarial samples to transfer the dark knowledge from the teacher to student. We select adversarial samples which are close to the decision boundary of two classes to metric the distance with negative class samples employing triplet loss constraint. The method guarantees the student network learning relationships among samples by quantitative metric learning. Therefore, we not only transfer information of the decision boundary but also ensure the student network can always maintain a proper distance from other negative classes. This can be another good exploration for knowledge distillation with adversarial samples. The experiments on CIFAR-10, CIFAR-100 and Tiny ImageNet datasets verify that the proposed knowledge distillation method works effectively on improving the student network performance.
- Research Article
47
- 10.1007/s11263-023-01792-z
- Apr 25, 2023
- International Journal of Computer Vision
Knowledge distillation is a simple yet effective technique for deep model compression, which aims to transfer the knowledge learned by a large teacher model to a small student model. To mimic how the teacher teaches the student, existing knowledge distillation methods mainly adapt an unidirectional knowledge transfer, where the knowledge extracted from different intermedicate layers of the teacher model is used to guide the student model. However, it turns out that the students can learn more effectively through multi-stage learning with a self-reflection in the real-world education scenario, which is nevertheless ignored by current knowledge distillation methods. Inspired by this, we devise a new knowledge distillation framework entitled multi-target knowledge distillation via student self-reflection or MTKD-SSR, which can not only enhance the teacher’s ability in unfolding the knowledge to be distilled, but also improve the student’s capacity of digesting the knowledge. Specifically, the proposed framework consists of three target knowledge distillation mechanisms: a stage-wise channel distillation (SCD), a stage-wise response distillation (SRD), and a cross-stage review distillation (CRD), where SCD and SRD transfer feature-based knowledge (i.e., channel features) and response-based knowledge (i.e., logits) at different stages, respectively; and CRD encourages the student model to conduct self-reflective learning after each stage by a self-distillation of the response-based knowledge. Experimental results on five popular visual recognition datasets, CIFAR-100, Market-1501, CUB200-2011, ImageNet, and Pascal VOC, demonstrate that the proposed framework significantly outperforms recent state-of-the-art knowledge distillation methods.
- Research Article
18
- 10.1016/j.knosys.2022.109832
- Sep 3, 2022
- Knowledge-Based Systems
Multi-instance semantic similarity transferring for knowledge distillation
- Research Article
12
- 10.1016/j.neucom.2024.127516
- Mar 5, 2024
- Neurocomputing
Multi-perspective analysis on data augmentation in knowledge distillation
- Research Article
1
- 10.1007/s11548-025-03346-9
- Apr 22, 2025
- International journal of computer assisted radiology and surgery
This paper aims to apply decoupled knowledge distillation (DKD) to medical image segmentation, focusing on transferring knowledge from a high-performance teacher network to a lightweight student network, thereby facilitating model deployment on embedded devices. We initially decouple the distillation loss into pixel-wise target class knowledge distillation (PTCKD) and pixel-wise non-target class knowledge distillation (PNCKD). Subsequently, to address the limitations of the fixed weight paradigm in PTCKD, we propose a novel feature distance-weighted adaptive decoupled knowledge distillation (FDWA-DKD) method. FDWA-DKD quantifies the feature disparity between student and teacher, generating instance-level adaptive weights for PTCKD. We design a feature distance weighting (FDW) module that dynamically calculates feature distance to obtain adaptive weights, integrating feature space distance information into logit distillation. Lastly, we introduce a class-wise feature probability distribution loss to encourage the student to mimic the teacher's spatial distribution. Extensive experiments conducted on the Synapse and FLARE22 datasets demonstrate that our proposed FDWA-DKD achieves satisfactory performance, yielding optimal Dice scores and, in some instances, surpassing the performance of the teacher network. Ablation studies further validate the effectiveness of each module within our proposed method. Our method overcomes the constraints of traditional distillation methods by offering instance-level adaptive learning weights tailored to PTCKD. By quantifying student-teacher feature disparity and minimizing class-wise feature probability distribution loss, our method outperforms other distillation methods.
- Research Article
52
- 10.1109/tnnls.2022.3212733
- May 1, 2024
- IEEE Transactions on Neural Networks and Learning Systems
Knowledge distillation (KD), as an efficient and effective model compression technique, has received considerable attention in deep learning. The key to its success is about transferring knowledge from a large teacher network to a small student network. However, most existing KD methods consider only one type of knowledge learned from either instance features or relations via a specific distillation strategy, failing to explore the idea of transferring different types of knowledge with different distillation strategies. Moreover, the widely used offline distillation also suffers from a limited learning capacity due to the fixed large-to-small teacher-student architecture. In this article, we devise a collaborative KD via multiknowledge transfer (CKD-MKT) that prompts both self-learning and collaborative learning in a unified framework. Specifically, CKD-MKT utilizes a multiple knowledge transfer framework that assembles self and online distillation strategies to effectively: 1) fuse different kinds of knowledge, which allows multiple students to learn knowledge from both individual instances and instance relations, and 2) guide each other by learning from themselves using collaborative and self-learning. Experiments and ablation studies on six image datasets demonstrate that the proposed CKD-MKT significantly outperforms recent state-of-the-art methods for KD.
- Book Chapter
9
- 10.1007/978-3-031-26284-5_31
- Jan 1, 2023
Knowledge distillation is an effective way to transfer knowledge from a large model to a small model, which can significantly improve the performance of the small model. In recent years, some contrastive learning-based knowledge distillation methods (i.e., SSKD and HSAKD) have achieved excellent performance by utilizing data augmentation. However, the worth of data augmentation has always been overlooked by researchers in knowledge distillation, and no work analyzes its role in particular detail. To fix this gap, we analyze the effect of data augmentation on knowledge distillation from a multi-sided perspective. In particular, we demonstrate the following properties of data augmentation: (a) data augmentation can effectively help knowledge distillation work even if the teacher model does not have the information about augmented samples, and our proposed diverse and rich Joint Data Augmentation (JDA) is more valid than single rotating in knowledge distillation; (b) using diverse and rich augmented samples to assist the teacher model in training can improve its performance, but not the performance of the student model; (c) the student model can achieve excellent performance when the proportion of augmented samples is within a suitable range; (d) data augmentation enables knowledge distillation to work better in a few-shot scenario; (e) data augmentation is seamlessly compatible with some knowledge distillation methods and can potentially further improve their performance. Enlightened by the above analysis, we propose a method named Cosine Confidence Distillation (CCD) to transfer the augmented samples’ knowledge more reasonably. And CCD achieves better performance than the latest SOTA HSAKD with fewer storage requirements on CIFAR-100 and ImageNet-1k. Our code is released at https://github.com/liwei-group/CCD.
- Research Article
49
- 10.1109/access.2020.2983174
- Jan 1, 2020
- IEEE Access
Convolutional neural networks (CNN) have a significant improvement in the accuracy of object detection. As networks become deeper, the precision of detection becomes obviously improved, and more floating-point calculations are also needed. Because of the great amount of calculation, it is inconvenient for mobile and embedded vision applications. Many researchers apply the knowledge distillation method to improve the precision of object detection by transferring knowledge from a deeper and larger teachers network to a small student one. Most methods of knowledge distillation are needed to design complex cost functions and mainly aim at the two-stage object detection algorithm. Therefore, we propose a clean and effective knowledge distillation method called Generative Adversarial Networks - Knowledge Distillation(GAN-KD) for the one-stage object detection. The feature maps generated by teacher network and student network are employed as true and fake samples respectively, and generating adversarial training for both of them to improve the performance of the student network in one-stage object detection. The experimental result shows that our approach achieves the performance gain of 5% mAP when compared with MobilenetV1 on COCO dataset.
- Research Article
2
- 10.3390/electronics11193018
- Sep 22, 2022
- Electronics
Deep learning is used for automatic modulation recognition in neural networks, and because of the need for high classification accuracy, deeper and deeper networks are used. However, these are computationally very expensive for neural network training and inference, so its utility in the case of a mobile with memory limitations or weak computational power is questionable. As a result, a trade-off between network depth and network classification accuracy must be considered. To address this issue, we used a knowledge distillation method in this study to improve the classification accuracy of a small network model. First, we trained Inception–Resnet as a teacher network, which has a size of 311.77 MB and a final peak classification accuracy of 93.09%. We used the method to train convolutional neural network 3 (CNN3) and increase its peak classification accuracy from 79.81 to 89.36%, with a network size of 0.37 MB. It was also used similarly to train mini Inception–Resnet and increase its peak accuracy from 84.18 to 93.59%, with a network size of 39.69 MB. When we compared all classification accuracy peaks, we discover that knowledge distillation improved small networks and that the student network had the potential to outperform the teacher network. Using knowledge distillation, a small network model can achieve the classification accuracy of a large network model. In practice, choosing the appropriate student network based on the constraints of the usage conditions while using knowledge distillation (KD) would be a way to meet practical needs.