Identifying the Number of Clusters in Short Text Using Bayesian Nonparametric Model

Jipeng Qiang,Tong Wang,Yun Li,Yunhao Yuan

doi:10.1109/ictai.2017.00052

Abstract

Before inferring the real number of clusters in short text clustering, Dirichlet Multinomial Mixture (DMM) model makes assumption that there are at most Kmax clusters. In some cases, it is difficult to choose a proper Kmax beforehand. In the paper, we propose a novel model based on Pitman-Yor Process to capture the power-law phenomenon of the cluster distribution. Specifically, each text chooses one of the active clusters or a new cluster with probabilities derived from the Pitman-Yor Process Mixture model (PYPM). Different from DMM model, our model does not require Kmax as input. Discriminative words and nondiscriminative words are identified automatically to help enhance text clustering. Parameters are estimated efficiently by collapsed Gibbs sampling. The experiments on real-world datasets validate the effectiveness of the proposed model in comparison with other state-of-theart models.

Full Text