Iterative pseudo-labelling with SoftMax probability in text classification

Jiyu Wang

doi:10.54254/2755-2721/6/20230738

Abstract

Semi-supervised learning is one of the potential research fields in text classification. In this paper, semi-supervised pseudo-label training experiments are conducted using the BERT model that has been pre-trained as a baseline. Only 20% of the original dataset is used for the new training set after segmenting the training set. The raw corpus used for pseudo-label training consists of the remaining 80% of data after labels are removed, while the original test set is still utilized. The results indicate that the key to the semi-supervised pseudo-labelling method is the performance of the original model and reasonable data filtering techniques. Even though the SoftMax value used for data filtering is not precisely equivalent to model prediction accuracy, experimental results show it can somewhat reduce the error propagation problem of the model. This is consistent with earlier research. However, using SoftMax as the threshold for data screening can't bring enough benefits to the model training and make it surpass the training performance of the original data set. As a result, future studies will focus on improving the accuracy of pseudo-labelling with a more suitable data selection method to better the model's performance.

Full Text