Adversarial Metric Knowledge Distillation

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon

Knowledge distillation is dedicated to improving the performance of light weight networks by transferring knowledge during the training process. Meanwhile, it is important to apply knowledge distillation on different situations. The previous knowledge distillation method with adversarial samples uses a traditional knowledge distillation loss to let the student learn a good decision boundary. In this paper, we propose a novel method named Adversarial Metric Knowledge Distillation (AMKD), which utilizes adversarial samples to transfer the dark knowledge from the teacher to student. We select adversarial samples which are close to the decision boundary of two classes to metric the distance with negative class samples employing triplet loss constraint. The method guarantees the student network learning relationships among samples by quantitative metric learning. Therefore, we not only transfer information of the decision boundary but also ensure the student network can always maintain a proper distance from other negative classes. This can be another good exploration for knowledge distillation with adversarial samples. The experiments on CIFAR-10, CIFAR-100 and Tiny ImageNet datasets verify that the proposed knowledge distillation method works effectively on improving the student network performance.

Similar Papers
  • Research Article
  • Cite Count Icon 6
  • 10.1016/j.dsp.2024.104512
Discretization and decoupled knowledge distillation for arbitrary oriented object detection
  • Apr 17, 2024
  • Digital Signal Processing
  • Cheng Chen + 2 more

Discretization and decoupled knowledge distillation for arbitrary oriented object detection

  • Conference Article
  • Cite Count Icon 4
  • 10.1145/3589334.3645440
Bit-mask Robust Contrastive Knowledge Distillation for Unsupervised Semantic Hashing
  • May 13, 2024
  • Liyang He + 6 more

Unsupervised semantic hashing has emerged as an indispensable technique for fast image search, which aims to convert images into binary hash codes without relying on labels. Recent advancements in the field demonstrate that employing large-scale backbones (e.g., ViT) in unsupervised semantic hashing models can yield substantial improvements. However, the inference delay has become increasingly difficult to overlook. Knowledge distillation provides a means for practical model compression to alleviate this delay. Nevertheless, the prevailing knowledge distillation approaches are not explicitly designed for semantic hashing. They ignore the unique search paradigm of semantic hashing, the inherent necessities of the distillation process, and the property of hash codes. In this paper, we propose an innovative Bit-mask Robust Contrastive knowledge Distillation (BRCD) method, specifically devised for the distillation of semantic hashing models. To ensure the effectiveness of two kinds of search paradigms in the context of semantic hashing, BRCD first aligns the semantic spaces between the teacher and student models through a contrastive knowledge distillation objective. Additionally, to eliminate noisy augmentations and ensure robust optimization, a cluster-based method within the knowledge distillation process is introduced. Furthermore, through a bit-level analysis, we uncover the presence of redundancy bits resulting from the bit independence property. To mitigate these effects, we introduce a bit mask mechanism in our knowledge distillation objective. Finally, extensive experiments not only showcase the noteworthy performance of our BRCD method in comparison to other knowledge distillation methods but also substantiate the generality of our methods across diverse semantic hashing models and backbones. The code for BRCD is available at https://github.com/hly1998/BRCD.

  • Research Article
  • Cite Count Icon 8
  • 10.1016/j.csl.2023.101583
Dual Knowledge Distillation for neural machine translation
  • Nov 9, 2023
  • Computer Speech & Language
  • Yuxian Wan + 4 more

Dual Knowledge Distillation for neural machine translation

  • Research Article
  • Cite Count Icon 5
  • 10.1016/j.asoc.2024.111579
PURF: Improving teacher representations by imposing smoothness constraints for knowledge distillation
  • Apr 9, 2024
  • Applied Soft Computing
  • Md Imtiaz Hossain + 3 more

PURF: Improving teacher representations by imposing smoothness constraints for knowledge distillation

  • Research Article
  • Cite Count Icon 2
  • 10.1177/15501477211057037
Feature fusion-based collaborative learning for knowledge distillation
  • Nov 1, 2021
  • International Journal of Distributed Sensor Networks
  • Yiting Li + 4 more

Deep neural networks have achieved a great success in a variety of applications, such as self-driving cars and intelligent robotics. Meanwhile, knowledge distillation has received increasing attention as an effective model compression technique for training very efficient deep models. The performance of the student network obtained through knowledge distillation heavily depends on whether the transfer of the teacher’s knowledge can effectively guide the student training. However, most existing knowledge distillation schemes require a large teacher network pre-trained on large-scale data sets, which can increase the difficulty of knowledge distillation in different applications. In this article, we propose a feature fusion-based collaborative learning for knowledge distillation. Specifically, during knowledge distillation, it enables networks to learn from each other using the feature/response-based knowledge in different network layers. We concatenate the features learned by the teacher and the student networks to obtain a more representative feature map for knowledge transfer. In addition, we also introduce a network regularization method to further improve the model performance by providing a positive knowledge during training. Experiments and ablation studies on two widely used data sets demonstrate that the proposed method, feature fusion-based collaborative learning, significantly outperforms recent state-of-the-art knowledge distillation methods.

  • Research Article
  • 10.1109/tmm.2026.3651026
CLIP-SD: CLIP-Enhanced Self-Distillation for Visual Recognition
  • Jan 1, 2026
  • IEEE Transactions on Multimedia
  • Xixi Wang + 3 more

Current knowledge distillation methods typically require significant computational resources and time to train task-specific teacher candidates from scratch and identify the optimal teacher. Although self-distillation methods eliminate the dependency on the teacher by allowing the student model to learn independently, they face two challenges: the student learns correct and incorrect knowledge indiscriminately, and the student's learning scope is limited due to the lack of external teacher supervision. Spurred by these deficiencies, this work proposes a CLIP-enhanced Self-Distillation (CLIP-SD) method to overcome these problems, while almost not increasing training time. CLIP-SD comprises two main components: Prediction-oriented Self-Distillation (PSD) and Two-stage Task-guided CLIP Distillation (TTCD). PSD tackles the first challenge by assigning higher and lower weights to correct and incorrect prediction samples, respectively, during self-distillation. This component forces the student to focus on correct knowledge and minimize the impact of incorrect knowledge. Regarding the second challenge, the robust CLIP model is directly introduced into self-distillation. However, CLIP lacks task-specific knowledge and its output is overly smooth during the distillation process, prohibiting the student from learning more effectively. Therefore, TTCD refines CLIP's output through a two-stage process, endowing it with task-specific knowledge to enhance student learning. Experimental results indicate that CLIP-SD significantly improves distillation performance while maintaining training efficiency comparable to self-distillation. Specifically, on the CIFAR-100 dataset, the performance of CLIP-SD reaches 72.48% when trained with ResNet20 as the student model, which is an average improvement of 2.54% and 1.12% over the knowledge distillation and self-distillation methods. Regarding training time, CLIP-SD takes 3.91 hours, an average decrease of 2.73 hours compared to knowledge distillation and an average increase of 0.45 hours compared to self-distillation. Despite the slight increase in training time compared to self-distillation, the overhead is worthwhile and negligible considering its performance improvement.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 47
  • 10.1007/s11263-023-01792-z
Multi-target Knowledge Distillation via Student Self-reflection
  • Apr 25, 2023
  • International Journal of Computer Vision
  • Jianping Gou + 5 more

Knowledge distillation is a simple yet effective technique for deep model compression, which aims to transfer the knowledge learned by a large teacher model to a small student model. To mimic how the teacher teaches the student, existing knowledge distillation methods mainly adapt an unidirectional knowledge transfer, where the knowledge extracted from different intermedicate layers of the teacher model is used to guide the student model. However, it turns out that the students can learn more effectively through multi-stage learning with a self-reflection in the real-world education scenario, which is nevertheless ignored by current knowledge distillation methods. Inspired by this, we devise a new knowledge distillation framework entitled multi-target knowledge distillation via student self-reflection or MTKD-SSR, which can not only enhance the teacher’s ability in unfolding the knowledge to be distilled, but also improve the student’s capacity of digesting the knowledge. Specifically, the proposed framework consists of three target knowledge distillation mechanisms: a stage-wise channel distillation (SCD), a stage-wise response distillation (SRD), and a cross-stage review distillation (CRD), where SCD and SRD transfer feature-based knowledge (i.e., channel features) and response-based knowledge (i.e., logits) at different stages, respectively; and CRD encourages the student model to conduct self-reflective learning after each stage by a self-distillation of the response-based knowledge. Experimental results on five popular visual recognition datasets, CIFAR-100, Market-1501, CUB200-2011, ImageNet, and Pascal VOC, demonstrate that the proposed framework significantly outperforms recent state-of-the-art knowledge distillation methods.

  • PDF Download Icon
  • Research Article
  • 10.3390/electronics13204102
Multiloss Joint Gradient Control Knowledge Distillation for Image Classification
  • Oct 17, 2024
  • Electronics
  • Wei He + 6 more

Knowledge distillation (KD) techniques aim to transfer knowledge from complex teacher neural networks to simpler student networks. In this study, we propose a novel knowledge distillation method called Multiloss Joint Gradient Control Knowledge Distillation (MJKD), which functions by effectively combining feature- and logit-based knowledge distillation methods with gradient control. The proposed knowledge distillation method discretely considers the gradients of the task loss (cross-entropy loss), feature distillation loss, and logit distillation loss. The experimental results suggest that logits may contain more information and should, consequently, be assigned greater weight during the gradient update process in this work. The empirical findings on the CIFAR-100 and Tiny-ImageNet datasets indicate that MJKD generally outperforms traditional knowledge distillation methods, significantly enhancing the generalization ability and classification accuracy of student networks. For instance, MJKD achieves a 63.53% accuracy on Tiny-ImageNet for the ResNet18 MobileNetV2 pair. Furthermore, we present visualizations and analyses to explore its potential working mechanisms.

  • Research Article
  • Cite Count Icon 145
  • 10.1609/aaai.v33i01.33013771
Knowledge Distillation with Adversarial Samples Supporting Decision Boundary
  • Jul 17, 2019
  • Proceedings of the AAAI Conference on Artificial Intelligence
  • Byeongho Heo + 3 more

Many recent works on knowledge distillation have provided ways to transfer the knowledge of a trained network for improving the learning process of a new one, but finding a good technique for knowledge distillation is still an open problem. In this paper, we provide a new perspective based on a decision boundary, which is one of the most important component of a classifier. The generalization performance of a classifier is closely related to the adequacy of its decision boundary, so a good classifier bears a good decision boundary. Therefore, transferring information closely related to the decision boundary can be a good attempt for knowledge distillation. To realize this goal, we utilize an adversarial attack to discover samples supporting a decision boundary. Based on this idea, to transfer more accurate information about the decision boundary, the proposed algorithm trains a student classifier based on the adversarial samples supporting the decision boundary. Experiments show that the proposed method indeed improves knowledge distillation and achieves the state-of-the-arts performance.

  • Research Article
  • Cite Count Icon 12
  • 10.1016/j.neucom.2024.127516
Multi-perspective analysis on data augmentation in knowledge distillation
  • Mar 5, 2024
  • Neurocomputing
  • Wei Li + 3 more

Multi-perspective analysis on data augmentation in knowledge distillation

  • Conference Article
  • Cite Count Icon 7
  • 10.1109/wacv56688.2023.00466
Adversarial local distribution regularization for knowledge distillation
  • Jan 1, 2023
  • Thanh Nguyen-Duc + 4 more

Knowledge distillation is a process of distilling information from a large model with significant knowledge capacity (teacher) to enhance a smaller model (student). Therefore, exploring the properties of the teacher is the key to improving student performance (e.g., teacher decision boundaries). One decision boundary exploring technique is to leverage adversarial attack methods, which add crafted perturbations within a ball constraint to clean inputs to create attack examples of the teacher called adversarial examples. These adversarial examples are informative examples because they are near decision boundaries. In this paper, we formulate a teacher adversarial local distribution, a set of all adversarial examples within the ball constraint given an input. This distribution is used to sufficiently explore the decision boundaries of the teacher by covering the full spectrum of possible teacher model perturbations. The student model is then regularized by matching the loss between teacher and student using these adversarial example inputs. We conducted a number of experiments on CIFAR-100 and Imagenet datasets to illustrate this teacher adversarial local distribution regularization (TALD) can be applied to improve performance of many existing knowledge distillation methods (e.g., KD, FitNet, CRD, VID, FT, etc.).

  • Conference Article
  • Cite Count Icon 3
  • 10.1109/yac57282.2022.10023833
Spatial-temporal consistency knowledge distillation for real-time semantic segmentation
  • Nov 19, 2022
  • Dongli Wang + 3 more

Real-time semantic segmentation is a key research topic in the application field of artificial intelligence such as automatic driving and intelligent robot. At present, the consumption of storage space and computing resources of the real-time semantic segmentation model is still huge. As an efficient model compression method, knowledge distillation is widely used in various fields of computer vision. In this paper, we propose a novel knowledge distillation framework based on generative adversarial network structure, which combines spatial consistency and temporal consistency. The teacher network in this framework jointly uses the CNN branch and transformer branch to improve the spatial consistency of lightweight real-time semantic segmentation of the student network. In addition, we integrate the inter-frame relationship obtained by the optical flow network and semantic segmentation network in continuous time as the time consistency constraint of the student network. Finally, spatial consistency and temporal consistency are coupled as spatial-temporal consistency knowledge. The main purpose of our knowledge distillation method is to transfer the spatio-temporal consistency knowledge contained by teachers to students. The student network obtained by knowledge distillation can process each frame independently in the inference stage, and our knowledge distillation method does not participate in the inference process of the student network, so it will not increase the computational cost of the student network in the inference process, but it can narrow the performance gap of real-time semantic segmentation between large model and compact model. Using our method, we can get a high-performance and efficient lightweight model. Finally, we verify the effectiveness of our proposed method on the Camvid dataset and the Cityscapes dataset.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 4
  • 10.1038/s41598-024-64041-4
Multistage feature fusion knowledge distillation
  • Jun 11, 2024
  • Scientific Reports
  • Gang Li + 5 more

Generally, the recognition performance of lightweight models is often lower than that of large models. Knowledge distillation, by teaching a student model using a teacher model, can further enhance the recognition accuracy of lightweight models. In this paper, we approach knowledge distillation from the perspective of intermediate feature-level knowledge distillation. We combine a cross-stage feature fusion symmetric framework, an attention mechanism to enhance the fused features, and a contrastive loss function for teacher and student models at the same stage to comprehensively implement a multistage feature fusion knowledge distillation method. This approach addresses the problem of significant differences in the intermediate feature distributions between teacher and student models, making it difficult to effectively learn implicit knowledge and thus improving the recognition accuracy of the student model. Compared to existing knowledge distillation methods, our method performs at a superior level. On the CIFAR100 dataset, it boosts the recognition accuracy of ResNet20 from 69.06% to 71.34%, and on the TinyImagenet dataset, it increases the recognition accuracy of ResNet18 from 66.54% to 68.03%, demonstrating the effectiveness and generalizability of our approach. Furthermore, there is room for further optimization of the overall distillation structure and feature extraction methods in this approach, which requires further research and exploration.

  • Book Chapter
  • Cite Count Icon 9
  • 10.1007/978-3-031-26284-5_31
What Role Does Data Augmentation Play in Knowledge Distillation?
  • Jan 1, 2023
  • Wei Li + 5 more

Knowledge distillation is an effective way to transfer knowledge from a large model to a small model, which can significantly improve the performance of the small model. In recent years, some contrastive learning-based knowledge distillation methods (i.e., SSKD and HSAKD) have achieved excellent performance by utilizing data augmentation. However, the worth of data augmentation has always been overlooked by researchers in knowledge distillation, and no work analyzes its role in particular detail. To fix this gap, we analyze the effect of data augmentation on knowledge distillation from a multi-sided perspective. In particular, we demonstrate the following properties of data augmentation: (a) data augmentation can effectively help knowledge distillation work even if the teacher model does not have the information about augmented samples, and our proposed diverse and rich Joint Data Augmentation (JDA) is more valid than single rotating in knowledge distillation; (b) using diverse and rich augmented samples to assist the teacher model in training can improve its performance, but not the performance of the student model; (c) the student model can achieve excellent performance when the proportion of augmented samples is within a suitable range; (d) data augmentation enables knowledge distillation to work better in a few-shot scenario; (e) data augmentation is seamlessly compatible with some knowledge distillation methods and can potentially further improve their performance. Enlightened by the above analysis, we propose a method named Cosine Confidence Distillation (CCD) to transfer the augmented samples’ knowledge more reasonably. And CCD achieves better performance than the latest SOTA HSAKD with fewer storage requirements on CIFAR-100 and ImageNet-1k. Our code is released at https://github.com/liwei-group/CCD.

  • Conference Article
  • Cite Count Icon 15
  • 10.1109/ijcnn48605.2020.9207148
Triplet Loss for Knowledge Distillation
  • Jul 1, 2020
  • Hideki Oki + 3 more

In recent years, deep learning has spread rapidly, and deeper, larger models have been proposed. However, the calculation cost becomes enormous as the size of the models becomes larger. Various techniques for compressing the size of the models have been proposed to improve performance while reducing computational costs. One of the methods to compress the size of the models is knowledge distillation (KD). Knowledge distillation is a technique for transferring knowledge of deep or ensemble models with many parameters (teacher model) to smaller shallow models (student model). Since the purpose of knowledge distillation is to increase the similarity between the teacher model and the student model, we propose to introduce the concept of metric learning into knowledge distillation to make the student model closer to the teacher model using pairs or triplets of the training samples. In metric learning, the researchers are developing the methods to build a model that can increase the similarity of outputs for similar samples. Metric learning aims at reducing the distance between similar and increasing the distance between dissimilar. The functionality of the metric learning to reduce the differences between similar outputs can be used for the knowledge distillation to reduce the differences between the outputs of the teacher model and the student model. Since the outputs of the teacher model for different objects are usually different, the student model needs to distinguish them. We think that metric learning can clarify the difference between the different outputs, and the performance of the student model could be improved. We have performed experiments to compare the proposed method with state-of-the-art knowledge distillation methods. The results show that the student model obtained by the proposed method gives higher performance than the conventional knowledge distillation methods.

Save Icon
Up Arrow
Open/Close
Notes

Save Important notes in documents

Highlight text to save as a note, or write notes directly

You can also access these Documents in Paperpal, our AI writing tool

Powered by our AI Writing Assistant