A Semi-Supervised Short Text Classification Method Based on Weighted Word Vector Representation

Zhiming Zhang,Geyu Huang,Jie Luo

doi:10.1109/iceiec.2019.8784604

Abstract

With the characteristics of growing dynamic, short and large, lack of semantic information and class labels, high-dimensional features and sparse problem, supervised learning has become an important research field to solve short text classification. Aiming at the defects of existing short text classification algorithms, a semi-supervised short text classification algorithm based on weighted word vector representation is proposed. Firstly, the strong category feature set representing the category information is extracted from the label short text dataset based on the improved expected cross entropy; Secondly, the eigenvector of short text is obtained by Worde2Vec training, and the mean value of eigenvector of short text, containing the strong feature word, is used as the virtual class center to calculate the cosine similarity between all short text and the virtual class center, after normalization. After transformation, the real center class of short text is calculated according to the weight of each short text calculated by introducing the normalization of above similarity. Finally, unlabeled data is classified according to the similarity between its' cluster center and the real class center of labeled data. The experimental data show that compared with the classical semi-supervised learning classification algorithm, the algorithm has higher classification accuracy and accuracy through simple weighting processing.

Full Text