Explanation Guided Knowledge Distillation for Pre-trained Language Model Compression

Zhao Yang,Yuanzhe Zhang,Dianbo Sui,Yiming Ju,Jun Zhao,Kang Liu

doi:10.1145/3639364

Abstract

Knowledge distillation is widely used in pre-trained language model compression, which can transfer knowledge from a cumbersome model to a lightweight one. Though knowledge distillation based model compression has achieved promising performance, we observe that explanations between the teacher model and the student model are not consistent. We argue that the student model should study not only the predictions of the teacher model but also the internal reasoning process. To this end, we propose Explanation Guided Knowledge Distillation (EGKD) in this article, which utilizes explanations to represent the thinking process and improve knowledge distillation. To obtain explanations in our distillation framework, we select three typical explanation methods rooted in different mechanisms, namely gradient-based , perturbation-based , and feature selection methods. Then, to improve computational efficiency, we propose different optimization strategies to utilize the explanations obtained by these three different explanation methods, which could provide the student model with better learning guidance. Experimental results on GLUE demonstrate that leveraging explanations can improve the performance of the student model. Moreover, our EGKD could also be applied to model compression with different architectures.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Explanation Guided Knowledge Distillation for Pre-trained Language Model Compression

Abstract

Talk to us

Similar Papers

More From: ACM Transactions on Asian and Low-Resource Language Information Processing

Lead the way for us

Journal: ACM Transactions on Asian and Low-Resource Language Information Processing	Publication Date: Feb 8, 2024
Citations: 1

Similar Papers

Knowledge Distillation for Neural Transducers from Large Self-Supervised Pre-Trained Models
Xiaoyu Yang ... Philip C Woodland
-
Xiaoyu Yang, et. al.Xiaoyu Yang ... Philip C Woodland
23 May 2022
23 May 2022

MoEBERT: from BERT to Mixture-of-Experts via Importance-Guided Adaptation

-

27 Jun 2022
27 Jun 2022

MoEBERT: from BERT to Mixture-of-Experts via Importance-Guided Adaptation
Simiao Zuo ... Qingru Zhang
-
Simiao Zuo, et. al.Simiao Zuo ... Qingru Zhang
01 Jan 2021
01 Jan 2021

Resolution-Aware Knowledge Distillation for Efficient Inference.
Zhanxiang Feng ... Jianhuang Lai
IEEE Transactions on Image Processing | VOL. 30
Zhanxiang Feng, et. al.Zhanxiang Feng ... Jianhuang Lai
01 Jan 2020
IEEE Transactions on Image Processing | VOL. 30

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Explanation Guided Knowledge Distillation for Pre-trained Language Model Compression

Abstract

Talk to us

Similar Papers

More From: ACM Transactions on Asian and Low-Resource Language Information Processing