Abstract

Text clustering is one of the important technical bases of natural language processing, and ensemble clustering improves the robustness of text clustering. According to the existing research of scholars and experts, the quality and diversity of basic clustering have a great influence on consensus clustering, and it has a particularly significant effect on text clustering. However, there are a few pieces of research aiming at reducing the number of low-quality clustering in ensembles. This paper proposes a novel clustering filtering model based on entropy criteria. The entropy criterion is used to evaluate the uncertainty of each cluster w.r.t. the ensemble. Two indexes are proposed on the basis of the uncertainty of cluster, namely, Clustering Trend Index (CTI) which indicates the contribution of each cluster w.r.t. basic clustering, and Cluster Consistency Index(CCI) which indicates the degree of cluster dispersion in the basic clustering. The proposed clustering filtering model is built on the basis of new weight using two proposed indexes. Thereby, by dropping the low-quality clustering, the percentage of high-quality clustering will increase. A large number of experiments on various real text data sets using optimal thresholds show that the proposed method has greatly improved accuracy and robustness, and is superior to existing ensemble clustering algorithms.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call