A Hot Topic Identification Based on LDA and K-means++ Algorithm for Tibetan Microblog

Li Ailin,Jiang Tao,Dai Yugang,Yu Hongzhi

doi:10.12783/dtetr/iect2016/3721

Abstract

as microblog grows more popular, services like Sina Weibo have become information providers on a web scale. Tibetan microblog is one of the most popular Tibetan network media. Researches on Tibetan microblog are now increasing. However, because of the special features of microblog text and the features of Tibetan language, traditional hot topic method of microblog cannot satisfy the need. This paper proposes a method that is a hot topic identification based on LDA and K-means++ algorithm for Tibetan microblog. Firstly, we use LDA to model microblog corpus to determine the best number of themes by the perplexity, and achieve parameters estimation with Gibbs sampling algorithm, then we can get the probability distribution of the theme and the word and the probability distribution of the theme and the document. Secondly, this paper use K-means++to cluster Tibetan microblog which have same or similar theme. Thirdly, we calculate and identify hot topics and hot words. Experimental results show that, compared with SVM method and LSA+K-means++ method, the accuracy of this method of LDA+K-means++ is higher.

Full Text