X-DMM: Fast and Scalable Model Based Text Clustering

Linwei Li,X Sean Wang,Zhenying He,Yinan Jing,Liangchen Guo

doi:10.1609/aaai.v33i01.33014197

Abstract

Text clustering is a widely studied problem in the text mining domain. The Dirichlet Multinomial Mixture (DMM) model based clustering algorithms have shown good performance to cope with high dimensional sparse text data, obtaining reasonable results in both clustering accuracy and computational efficiency. However, the time complexity of DMM model training is proportional to the average document length and the number of clusters, making it inefficient for scaling up to long text and large corpora, which is common in realworld applications such as documents organization, retrieval and recommendation. In this paper, we leverage a symmetric prior setting for Dirichlet distribution, and build indices to decrease the time complexity of the sampling-based training for DMM from O(K∗L) to O(K∗U), where K is the number of clusters, L the average length of document, and U the average number of unique words in each document. We introduce a Metropolis-Hastings sampling algorithm, which further reduces the sampling time complexity from O(K∗U) to O(U) in the nearly-to-convergence training stages. Moreover, we also parallelize the DMM model training to obtain a further acceleration by using an uncollapsed Gibbs sampler. We combine all these optimizations into a highly efficient implementation, called X-DMM, which enables the DMM model to scale up for long and large-scale text clustering. We evaluate the performance of X-DMM on several real world datasets, and the experimental results show that XDMM achieves substantial speed up compared with existing state-of-the-art algorithms without clustering accuracy degradation.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

X-DMM: Fast and Scalable Model Based Text Clustering

Abstract

Talk to us

Similar Papers

More From: Proceedings of the AAAI Conference on Artificial Intelligence

Lead the way for us

Journal: Proceedings of the AAAI Conference on Artificial Intelligence	Publication Date: Jul 17, 2019
Citations: 4

Similar Papers

Identifying the Number of Clusters in Short Text Using Bayesian Nonparametric Model
Jipeng Qiang ... Tong Wang
-
Jipeng Qiang, et. al.Jipeng Qiang ... Tong Wang
01 Nov 2017
01 Nov 2017

Activity discovery using Dirichlet multinomial mixture models from discrete sensor data in smart homes
Ken Sadohara
Personal and Ubiquitous Computing | VOL. 26
Ken SadoharaKen Sadohara
19 Jul 2022
Personal and Ubiquitous Computing | VOL. 26

Enhancing Topic Modeling for Short Texts with Auxiliary Word Embeddings
Chenliang Li ... Aixin Sun
ACM Transactions on Information Systems | VOL. 36
Chenliang Li, et. al.Chenliang Li ... Aixin Sun
21 Aug 2017
ACM Transactions on Information Systems | VOL. 36

Hybrid topic modeling method based on dirichlet multinomial mixture and fuzzy match algorithm for short text clustering
Mutasem K Alsmadi ... Ahmad Al Smadi
Journal of Big Data | VOL. 11
Mutasem K Alsmadi, et. al.Mutasem K Alsmadi ... Ahmad Al Smadi
09 May 2024
Journal of Big Data | VOL. 11

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

X-DMM: Fast and Scalable Model Based Text Clustering

Abstract

Talk to us

Similar Papers

More From: Proceedings of the AAAI Conference on Artificial Intelligence