Abstract

Knowledge distillation (KD) is one of the most effective neural network light-weighting techniques when training data is available. However, KD is seldom applicable to an environment where it is difficult or impossible to access training data. To solve this problem, a complete zero-shot KD (C-ZSKD) based on adversarial learning has been recently proposed, but the so-called biased sample generation problem limits the performance of C-ZSKD. To overcome this limitation, this paper proposes a novel C-ZSKD algorithm that utilizes a label-free adversarial perturbation. The proposed adversarial perturbation derives a constraint of the squared norm of gradient style by using the convolution of probability distributions and the 2nd order Taylor series approximation. The constraint serves to increase the variance of the adversarial sample distribution, which makes the student model learn the decision boundary of the teacher model more accurately without labeled data. Through analysis of the distribution of adversarial samples on the embedded space, this paper also provides an insight into the characteristics of adversarial samples that are effective for adversarial learning-based C-ZSKD.

Highlights

  • With the advent of effective solutions [1], [2] to the gradient vanishing problem, deep neural networks that provide high recognition performance have been developed rapidly

  • To solve an inherent biased sample generation problem of adversarial learning (AL)-based complete zero-shot KD (C-zero-shot KD (ZSKD)), we propose a method to increase the variance of the adversarial sample distribution by using the convolution of probability distributions and Taylor series approximation

  • By analyzing the distribution of adversarial samples in the embedding space, this paper provides an insight into the characteristics of adversarial samples that are useful for AL-based C-ZSKD

Read more

Summary

Introduction

With the advent of effective solutions [1], [2] to the gradient vanishing problem, deep neural networks that provide high recognition performance have been developed rapidly. Hinton et al firstly introduced the concept of knowledge distillation (KD) to effectively lighten neural networks [3]. KD is a technique that transfers knowledge from large networks that perform similar tasks to relatively small networks. KD allows small networks to overcome the limitations of training, but tries to have the same performance as large networks. Conventional KD techniques implicitly assume that training data is always available. A. COMPLETE ZERO-SHOT KNOWLEDGE DISTILLATION Unlike G-ZSKD, C-ZSKD can be said to be a highly scalable method because it operates even in an environment where training data is completely blocked. Assuming that the label of each class follows a Dirichlet distribution (D), the concentration parameter of D was derived from the weight W. The pseudo labels (y) were sampled from D, and the corresponding pseudo image (x∗) was generated according to Eq (1)

Objectives
Methods
Findings
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.