Corpus-based topic diffusion for short text clustering

Chu Tao Zheng,Cheng Liu,Hau San Wong

doi:10.1016/j.neucom.2017.11.019

Chu Tao Zheng, Cheng Liu + Show 1 more

https://doi.org/10.1016/j.neucom.2017.11.019

Copy DOI

Export

Save

Cite

Journal: Neurocomputing	Publication Date: Nov 16, 2017
Citations: 38

Affiliation: City University of Hong Kong

Abstract
Full-Text
Similar Papers

Abstract

Listen

In this paper, we propose a novel corpus-based enrichment approach for short text clustering. Since sparseness brings about the problem of insufficient word co-occurrence and lack of context information, previous researches use external sources such as Wikipedia or WordNet to enrich the representation of short text documents, which requires extra resources and might lead to possible inconsistency. On the other hand, corpus-based approaches use no external information in mining short text data. By introducing a set of conjugate definitions to characterize the structures of topics and words, and by proposing a virtual generative procedure for short texts, we perform expansion on short text data. Specifically, new words which may not appear in a short text document were added with a virtual term frequency, and this virtual frequency is obtained from the posterior probabilities of new words given all the words in that document. The complete procedure can be regarded as mapping data points (documents) from the original feature space to a hidden semantic space (topic space). After performing semantic smoothing, data points are then mapped back to the original space. We conduct experiments on two short text datasets, and the results show that the proposed method can effectively address the sparseness problem. For these datasets, our method, using only a basic clustering algorithm, attains a comparable performance with methods based on enrichment with external information sources.

Full Text