Pre-training Fine-tuning data Enhancement method based on active learning

Deqi Cao,Haoyang Ma,Fei Wang,Zhaoyun Ding

doi:10.1109/trustcom56396.2022.00205

Abstract

With the development of Internet technology, the number of Internet users increases rapidly, and the amount of data generated on the Internet is very large every day. At the same time, with the development of storage technology and query technology, it is very easy to collect massive data, but the information value contained in these data is uneven, and most of them are unmarked. However, traditional supervised learning has a great demand for labeled samples. Faced with a large number of unlabeled samples, there is a problem of the lack of effective automatic labeling methods, and manual labeling costs are high. If the strategy of simple random sampling is used for annotation, it may lead to the selection of noisy information and waste of resources, and low-quality training data could also have an influence on the prediction accuracy of the model. Meanwhile, the training effect of traditional deep learning methods is very limited for small sample labeled training sets.This paper takes the text emotion analysis task in natural language processing as the background, selects IMDB film review data as the training set and test set, starts with the design of active learning algorithm based on clustering analysis, combined with the appropriate pre-training fine-tuning model, constructs a data enhancement method based on active learning. In the experiment, it is found that when the labeled training set is reduced by 90%, the prediction accuracy of the pre-training model is reduced by no more than 2%, which verifies the effectiveness of the data enhancement method combining active learning with the pre-training model.

Full Text