Abstract
Text classification task can help people to discover valuable information hidden in the text set. Many previous studies have achieved excellent results in traditional text classification tasks. With the development of new social media, a large number of short texts appear on the Internet. Due to the sparsity of the short text, many classification algorithms which achieve excellent results on long texts hardly achieve satisfactory results on short texts. Therefore, it is important to find a method for calculating effective vector representations of words and overcoming the feature sparseness problem. Based on the above, we carry out the work from improving the quality of word vector representation and enhancing the effect of classification. This paper proposes a systematic framework for improving short text classification performance. In our framework, we first build a topic model with Latent Dirichlet Allocation (LDA) on a universal dataset from Wikipedia and use this model to perform topic inference on short texts. Then, an improved scheme of topical word embedding (TWE) is proposed to learn the vector representations of both words and topics, which use the word in the current word-topic pair to predict the contextual words and the topic in the same word-topic pair to predict its contextual topics. In addition, the supervised Multi-Cluster Feature Selection algorithm (MCFS) is employed to execute topic selection, and we propose a topic merging strategy that is based on the MCFS. At the end of topic selection and merging, short text matrices are generated using the vector representations of both words and topics, and these matrices are fed into a convolution neural network (CNN). On an open short text classification dataset, we compared the proposed framework with various baselines, and the experimental results indicate the effectiveness of our method.
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have