Knowledge distillation and data augmentation for NLP light pre-trained models

Hanwen Luo,Yuqing Zhang,Xiaodong Wang,Yudong Li

doi:10.1088/1742-6596/1651/1/012043

Abstract

Model lightweight aims to solve the problems of various large models for slow training and high resource requirement. Knowledge distillation can be a good solution to these problems. We built a lightweight model that meet the competition requirements and have prominent NLP capabilities. RoBERTa-tiny-clue was used as our backbone model. We tested the effect of soft labels and hard labels on knowledge distillation, made knowledge distillation, fine-tuned this model to get a lighter model with better performance, and then applied it downstream NLP tasks. We also adopted a series of data augmentation methods to improve the performance of the model on downstream tasks, customized different optimization solutions for four tasks. Based on open-source pre-trained model RoBERTa-tiny-clue and public available datasets, we achieved 15 times smaller and 10 times faster than BERT-base, and 95% of BERT-base performance on downstram NLP tasks. Using suitable data augmentation methods for the trained lightweight model, the performance of the model on various downstream tasks reaches or exceeds BERT-base.

Full Text