WEFEST: Word Embedding Feature Extension for Short Text Classification

Lei Sang,Xiaojian Liu,Fei Xie,Xindong Wu

doi:10.1109/icdmw.2016.0101

Abstract

Short text classification is a crucial task for information retrieval, social medial text categorization, and many other applications. In reality, due to the inherent sparsity and the limited information available in the short texts, learning and classifying short texts is a significant challenge. In this paper, we propose a new framework, WEFEST, which expands short texts using word embedding for classification. WEFEST is rooted on the deep language model, which learns a new word embedding space, by using word correlations, such that semantically related words also have close feature vectors in the new space. By using word embedding features to help expand the short tests, WEFEST can enrich the word density in the short texts for effective learning, by following three major steps. First, each short text in the training dataset is enriched by using pre-trained word feature embedding. Then the semantic similarity between two short texts is calculated by using the statistical frequency information retrieved from the trained model. Finally, we use the nearest neighbor algorithm to achieve short text classification. Experimental results on Chinese news title dataset validate the effectiveness of the proposed method.

Full Text