Abstract

The success of pre-trained language representation models such as BERT benefits from their “overparameterized” nature, resulting in training time consuming, high computational complexity and superior requirement of devices. Among the variety of model compression and acceleration techniques, Knowledge Distillation(KD) has attracted extensive attention for compressing pre-trained language models. However, the major two challenges for KD are: (i)Transfer more knowledge from the teacher model to student model without scarifying accuracy while accelerating. (ii)Higher training speed of the lightweight model is accompanied by the risk of overfitting due to the noise influence. To address these problems, we propose a novel model based on knowledge distillation, called Theseus-BERT Guided Distill CNN(TBG-disCNN). BERT-of-Theseus [1] is employed as the teacher model and CNN as the student model. Aiming at the inherent noise problem, we propose coordinated CNN-BiLSTM as a parameter-sharing layer for Multi-Task Learning (MTL), in order to capture both regional and long-term dependence information. Our approach has approximately good performance as BERT-base and teacher model with $12 \times$ and $281 \times$ speedup of inference and $19.58 \times$ and $8.94 \times$ fewer parameters usage, respectively.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call