Systematic framework for short text classification based on improved TWE and supervised MCFS topic merging strategy

Baoshan Sun,Mengying Ge,Peng Zhao,Chunqing Li

doi:10.1080/1206212x.2020.1761597

Baoshan Sun, Mengying Ge + Show 2 more

https://doi.org/10.1080/1206212x.2020.1761597

Copy DOI

Export

Save

Cite

Abstract
Full-Text
Similar Papers

Abstract

Listen

Text classification task can help people to discover valuable information hidden in the text set. Many previous studies have achieved excellent results in traditional text classification tasks. With the development of new social media, a large number of short texts appear on the Internet. Due to the sparsity of the short text, many classification algorithms which achieve excellent results on long texts hardly achieve satisfactory results on short texts. Therefore, it is important to find a method for calculating effective vector representations of words and overcoming the feature sparseness problem. Based on the above, we carry out the work from improving the quality of word vector representation and enhancing the effect of classification. This paper proposes a systematic framework for improving short text classification performance. In our framework, we first build a topic model with Latent Dirichlet Allocation (LDA) on a universal dataset from Wikipedia and use this model to perform topic inference on short texts. Then, an improved scheme of topical word embedding (TWE) is proposed to learn the vector representations of both words and topics, which use the word in the current word-topic pair to predict the contextual words and the topic in the same word-topic pair to predict its contextual topics. In addition, the supervised Multi-Cluster Feature Selection algorithm (MCFS) is employed to execute topic selection, and we propose a topic merging strategy that is based on the MCFS. At the end of topic selection and merging, short text matrices are generated using the vector representations of both words and topics, and these matrices are fed into a convolution neural network (CNN). On an open short text classification dataset, we compared the proposed framework with various baselines, and the experimental results indicate the effectiveness of our method.

Full Text