Abstract
Recent advances in pretraining language models have obtained state-of-the-art results in various natural language processing tasks. However, these huge pretraining language models are difficult to be used in practical applications, such as mobile devices and embedded devices. Moreover, there is no pretraining language model for the chemical industry. In this work, we propose a method to pretrain a smaller language representation model of the chemical industry domain. First, a huge number of chemical industry texts are used as pretraining corpus, and nontraditional knowledge distillation technology is used to build a simplified model to learn the knowledge in the BERT model. By learning the embedded layer, the middle layer, and the prediction layer at different stages, the simplified model not only learns the probability distribution of the prediction layer but also learns the embedded layer and the middle layer at the same time, to acquire the learning ability of BERT model. Finally, it is applied to the downstream tasks. Experiments show that, compared with the current BERT model distillation method, our method makes full use of the rich feature knowledge in the middle layer of the teacher model while building a student model based on the BiLSTM architecture, which effectively solves the problem that the traditional student model based on the transformer architecture is too large and improves the accuracy of the language model in the chemical domain.
Highlights
Related WorkE specific method is to first train the Bidirectional LSTM model with the language model as the target on the large corpus and use the LSTM to generate the word representation
Many researches showed that the domain-specific pretraining language model can perform better in domain tasks
(1) Traditional knowledge distillation methods on BERT models often failed to fully learn the representational capabilities of each layer of the teacher model, or to learn these, student models based on transformer architecture still needed to be used, and these student models still had a huge number of parameters. erefore, we proposed a multilayer BiLSTM architecture for student models to fully learn the representational capabilities of the teacher model, which significantly reduced the number of student model parameters at the expense of only a small portion of performance compared to the former
Summary
E specific method is to first train the Bidirectional LSTM model with the language model as the target on the large corpus and use the LSTM to generate the word representation. E fine-tuning method is to pretrain the language model on a large corpus without monitoring the target and use the labeled data in the domain to fit the model for subsequent applications. We studied the problem of compression of linguistic models on a large scale and proposed a training method and device of the Chemical Industry Chinese Language Model based on knowledge distillation to effectively transfer the knowledge of the teacher to the model of the student
Published Version (Free)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have