Combination of Chinese sentiment analysis datasets based on BiLSTM+Attention model

Kaiyuan Jiang

doi:10.54254/2755-2721/36/20230432

Abstract

By training the Chinese sentiment analysis model, it is found that the prediction accuracy of the model trained by one dataset is obviously low on other datasets. Considering that the existing sentiment analysis work mainly uses a single domain corpus dataset and referring to the existing data processing methods on natural language processing, this paper designs an experiment to combine Chinese datasets from different fields into a large field-imbalanced dataset, and the number of samples from different fields in this dataset is obviously different. The new dataset is used to train a comprehensive Chinese sentiment analysis model and achieves satisfactory training results. According to the results of the experiments, the model trained by the field-imbalanced dataset has high prediction accuracy for samples from various fields, and the prediction accuracy increases with the increase of the proportion of corpus in this field in the training dataset. Through the experiment in this paper, some ideas are provided for the construction of large-scale cross-domain Chinese sentiment analysis datasets in the future.

Full Text