Topic Discovery for Streaming Short Texts with CTM

Yunfeng Xu,Xiaomin Sun,Junhui Deng,Hanyong Hao,Longxia Zhu,Xiaoli Bai,Hua Xu

doi:10.1109/ijcnn.2018.8489770

Abstract

Short texts are prevalent on today’s Web, especially with the emergence of social media. However, how to discover the topics of streaming short texts has become an important task for many content analysis applications. Conventional topic models such as Probabilistic Latent Semantic Analysis (PLSA) and Latent Dirichlet Allocation (LDA) will suffer from sparsity problem when we infer the latent topics from short texts with them. The reason is that they derive topics from document-level word co-occurrence by modeling each document as a mixture of topics. Different from the above idea, Biterm Topic Model (BTM) discovers topics in short texts by directly modeling the generation of word co-occurrence patterns in the whole corpus. But semantic information is lacking for short texts. In this paper, in order to alleviate the sparsity problem, keep the semantic information of documents and get the latent topic information of streaming short texts immediately, we propose a joint topic model for Chinese streaming short texts (CTM) based on the online algorithms of LDA and BTM. Experiments on short texts from Sina Weibo show that our joint topic model can discover more precise topics and carry out more applications. In addition, considering the preprocessing in Chinese text is different from English and errors in extracting key phrases, we use a combined word method to extend the length of short texts and reduce errors in extracting key phrases.

Full Text