The category imbalance of data in text sentiment classification is a widely existent phenomenon, and it is a serious challenge for designing an effective classifier. In this paper, we propose a two-stage data balancing scheme for text sentiment classification, which not only can make the data boundary clear, but also can balance the class distribution of training data set. The core algorithm LDMRC of the scheme is proposed based on the shortest distance from a point to a straight line, to remove some majority class texts in the neighborhood of a minority class text for balancing the class distribution of data in the local dense mixed region. The second stage employs SS or RS as a data rebalancing strategy to globally balance the training dataset after local dense mixed region cutting. The proposed two-stage data balancing scheme is used by situating at the front of a learning algorithm such as SVM. Using the machine learning algorithm SVM on eight imbalanced data sets including Book_c, Hotel, Jadeite, Insurance in Chinese, and DVD, Book_e, Electronics, Kitchen in English, we verify the effectiveness of the proposed method. The experimental results show that LDMRC is superior to the best existing cutting algorithm BRC for Acc, RN and FN. Furthermore, LDMRC+SS and LDMRC+RS are superior to the corresponding method LDMRC on Chinese datasets. This indicates that alone use of local boundary cutting cannot obtain the best effect, and data rebalancing strategies are necessary for text sentiment classification.
Read full abstract