A Semi-Supervised Text Clustering Algorithm with Word Distribution Weights

Jiayin Wei,Ping Zhou,Yongbin Qin

doi:10.2991/icetis-13.2013.235

Abstract

Semi-supervised text clustering, as a research branch of the text clustering, aims at employing limited priori knowledge to aid unsupervised text clustering process, and helping users get improved clustering results. Because labeled data are difficult, expensive and time-consuming to obtain, it is important to use the supervised information effectively to improve the performance of clustering significantly. This paper proposes a semi-supervised LDA text clustering algorithm based on the weights of word distribution (WWDLDA). By introducing the coefficients of word distribution obtained from labeled data, LDA model can be used in the field of semi-supervised clustering. In the process of clustering, coefficients always adjust the word distribution to change the clustering results. Our experimental results on real data sets show that the proposed semi-supervised text clustering algorithm can get better clustering results than constrained mixmnl, where mixmnl stands for multinomial model-based EM algorithm. Introduction Text clustering, as an important method of knowledge discovery, is a procedure and an unsupervised method of automatic text classification. By analyzing the relationship between documents, text clustering makes the same theme articles classified as a class. Without the training process and prior category label, text clustering is provided with higher ability of automatic processing and flexibility, which is widely used in data mining, information retrieval and theme testing. Research on text clustering is demonstrated in [1-3]. Traditional document clustering algorithm is an unsupervised learning method that processes unlabeled documents. In practical applications, however, people can get limited priori knowledge of the data, including class labels and documentation division of constraints conditions (such as pairwise constraints) [4]. Semi-supervised text clustering is a text clustering research branch. It utilizes priori labeled data to guide unsupervised text clustering process on the basis of the traditional text clustering method, and gets better clustering results. Semi-supervised text clustering has recently become a topic of significant interest. The complexity of document corpora has led to considerable interest in applying hierarchical statistical models based on what are called topic. Topic model could reduce data dimension by changing the document representation from by words to by topic, and achieve new document representation. Among topic models, Latent Dirichlet allocation (LDA) [5] is one of the simplest, most popular models and arguably most important probabilistic models in widespread use today. While cluster documents according to topic, the obtained distribution of topic helps us get clustering results. Therefore LDA can be applied on text clustering. LDA is a unsupervised learning algorithm. This paper puts forward a new semi-supervised text clustering algorithm, which embed weights of words distribution to LDA. The coefficient guides the clustering process by updating the word item distribution, and then enhances the clustering performance. The semi-supervised LDA text clustering algorithm based on the weights of word distribution (WWDLDA) is experimented on real data sets. The experimental results show that WWDLDA has a better performance than the constrained mixmnl algorithm [6]. International Conference on Education Technology and Information System (ICETIS 2013) © 2013. The authors Published by Atlantis Press 1024 Latent Dirichlet Allocation Latent Dirichlet allocation (LDA) presented by Blei is a topic model and a generative probabilistic model of a corpus. A document consisting of a large number of words might be concisely modeled as deriving from a smaller number of topics. A topic is a probability distribution over words. The basic idea of LDA is that documents are represented as random mixtures over latent topics, where each topic is characterized by a distribution over words. Fig.1. Graphical model representation of LDA. According to the graphical model representation shown in Figure 1, LDA assumes the following generative process for a document: first, choose a variableθ ,whereθ is the random variable parameter of a multinomial over topics and θ follows Dirichlet distribution; secondly choose a topic n z and then choose a word n w from a multinomial probability conditioned on the topic n z ; last repeated choosing topic and word N times. The probability of a corpus is obtained. ( ) ( ) ( ) ( ) 1 1 p | , | | | , d

Full Text