Improving document clustering in a learned concept space

Jean-François Pessiot,Young-Min Kim,Massih R Amini,Patrick Gallinari

doi:10.1016/j.ipm.2009.09.007

Abstract

Most document clustering algorithms operate in a high dimensional bag-of-words space. The inherent presence of noise in such representation obviously degrades the performance of most of these approaches. In this paper we investigate an unsupervised dimensionality reduction technique for document clustering. This technique is based upon the assumption that terms co-occurring in the same context with the same frequencies are semantically related. On the basis of this assumption we first find term clusters using a classification version of the EM algorithm. Documents are then represented in the space of these term clusters and a multinomial mixture model ( MM) is used to build document clusters. We empirically show on four document collections, Reuters-21578, Reuters RCV2-French, 20Newsgroups and WebKB, that this new text representation noticeably increases the performance of the MM model. By relating the proposed approach to the Probabilistic Latent Semantic Analysis ( PLSA) model we further propose an extension of the latter in which an extra latent variable allows the model to co-cluster documents and terms simultaneously. We show on these four datasets that the proposed extended version of the PLSA model produces statistically significant improvements with respect to two clustering measures over all variants of the original PLSA and the MM models.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Improving document clustering in a learned concept space

Abstract

Talk to us

Similar Papers

More From: Information Processing and Management

Lead the way for us

Journal: Information Processing and Management	Publication Date: Oct 21, 2009
Citations: 20

Similar Papers

Adjusting Mixture Weights of Gaussian Mixture Model via Regularized Probabilistic Latent Semantic Analysis
Luo Si ... Rong Jin
-
Luo Si, et. al.Luo Si ... Rong Jin
01 Jan 2004
01 Jan 2004

Multi-Scale Multi-Level Generative Model in Scene Classification
Wenjie Xie ... Yingjun Tang
IEICE Transactions on Information and Systems | VOL. E94-D
Wenjie Xie, et. al.Wenjie Xie ... Yingjun Tang
01 Jan 2010
IEICE Transactions on Information and Systems | VOL. E94-D

An Object-Oriented Semantic Clustering Algorithm for High-Resolution Remote Sensing Images Using the Aspect Model
Wenbin Yi ... Hong Tang
IEEE Geoscience and Remote Sensing Letters | VOL. 8
Wenbin Yi, et. al.Wenbin Yi ... Hong Tang
01 May 2011
IEEE Geoscience and Remote Sensing Letters | VOL. 8

Online PLSA: batch updating techniques including out-of-vocabulary words.
Nikoletta K Bassiou ... Constantine L Kotropoulos
IEEE transactions on neural networks and learning systems | VOL. 25
Nikoletta K Bassiou, et. al.Nikoletta K Bassiou ... Constantine L Kotropoulos
01 Nov 2014
IEEE transactions on neural networks and learning systems | VOL. 25

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Improving document clustering in a learned concept space

Abstract

Talk to us

Similar Papers

More From: Information Processing and Management