Abstract

Large language model is a highly effective and promising language model technology that can improve the performance and robustness of natural language processing tasks. As a representative work, Bidirectional Encoder Representations from Transformers (BERT) has excellent performance in various natural language processing tasks. This model is pre-trained on large-scale language dataset and has gained attention from all walks of life since its introduction. However, its huge number of participants and scale make its performance in mobile very limited. As an effective technique to compress neural network, knowledge distillation can obtain a lightweight model with smaller parameters without losing too much model performance. Therefore, distillation of BERT models has been started, aiming at obtaining lightweight BERT models. In this paper, we will introduce several common BERT distillation models and analyse their model architecture, distillation process, and finally the compression efficiency and model effectiveness. It is concluded that the process of increasing recompression efficiency is often accompanied by decreasing model effectiveness.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call