Comparative analysis of strategies of knowledge distillation on BERT for text matching

Yifan Yang

doi:10.54254/2755-2721/51/20241188

Abstract

Large language model is a highly effective and promising language model technology that can improve the performance and robustness of natural language processing tasks. As a representative work, Bidirectional Encoder Representations from Transformers (BERT) has excellent performance in various natural language processing tasks. This model is pre-trained on large-scale language dataset and has gained attention from all walks of life since its introduction. However, its huge number of participants and scale make its performance in mobile very limited. As an effective technique to compress neural network, knowledge distillation can obtain a lightweight model with smaller parameters without losing too much model performance. Therefore, distillation of BERT models has been started, aiming at obtaining lightweight BERT models. In this paper, we will introduce several common BERT distillation models and analyse their model architecture, distillation process, and finally the compression efficiency and model effectiveness. It is concluded that the process of increasing recompression efficiency is often accompanied by decreasing model effectiveness.

Full Text