Knowledge Distillation as Efficient Pre-training: Faster Convergence, Higher Data-efficiency, and Better Transferability

Ruifei He,Shuyang Sun,Xiaojuan Qi,Song Bai,Jihan Yang

doi:10.1109/cvpr52688.2022.00895

Abstract

Large-scale pre-training has been proven to be crucial for various computer vision tasks. However, with the increase of pre-training data amount, model architecture amount, and the private/inaccessible data, it is not very efficient or possible to pre-train all the model architectures on large-scale datasets. In this work, we investigate an alternative strategy for pre-training, namely Knowledge Distillation as Efficient Pre-training (KDEP), aiming to efficiently transfer the learned feature representation from existing pre-trained models to new student models for future downstream tasks. We observe that existing Knowledge Distillation (KD) methods are unsuitable towards pre-training since they normally distill the logits that are going to be discarded when transferred to downstream tasks. To resolve this problem, we propose a feature-based KD method with non-parametric feature dimension aligning. Notably, our method performs comparably with supervised pre-training counterparts in 3 downstream tasks and 9 downstream datasets requiring 10× less data and 5× less pre-training time. Code is available at https://github.com/CVMI-Lab/KDEP.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Knowledge Distillation as Efficient Pre-training: Faster Convergence, Higher Data-efficiency, and Better Transferability

Abstract

Talk to us

Similar Papers

Lead the way for us

Similar Papers

PURF: Improving teacher representations by imposing smoothness constraints for knowledge distillation
Md Imtiaz Hossain ... Eui-Nam Huh
Applied Soft Computing | VOL. 159
Md Imtiaz Hossain, et. al.Md Imtiaz Hossain ... Eui-Nam Huh
09 Apr 2024
Applied Soft Computing | VOL. 159

Discretization and decoupled knowledge distillation for arbitrary oriented object detection
Cheng Chen ... Hongwei Ding
Digital Signal Processing | VOL. 150
Cheng Chen, et. al.Cheng Chen ... Hongwei Ding
17 Apr 2024
Digital Signal Processing | VOL. 150

Dual Knowledge Distillation for neural machine translation
Yuxian Wan ... Yanxia Li
Computer Speech & Language | VOL. 84
Yuxian Wan, et. al.Yuxian Wan ... Yanxia Li
09 Nov 2023
Computer Speech & Language | VOL. 84

Feature fusion-based collaborative learning for knowledge distillation
Yiting Li ... Liyuan Sun
International Journal of Distributed Sensor Networks | VOL. 17
Yiting Li, et. al.Yiting Li ... Liyuan Sun
01 Nov 2021
International Journal of Distributed Sensor Networks | VOL. 17

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Knowledge Distillation as Efficient Pre-training: Faster Convergence, Higher Data-efficiency, and Better Transferability

Abstract

Talk to us

Similar Papers