Ensemble subspace clustering of text data using two-level features

He Zhao,Yeshou Cai,Salman Salloum,Joshua Zhexue Huang

doi:10.1007/s13042-016-0556-5

Abstract

This paper proposes a new integrated method for ensemble subspace clustering of high dimensional sparse text data. Our method employs two-level feature representation of text data (words and topics) to generate clusters from subspaces. We also use ensemble clustering to increase the robustness of the clusters. This method depends on topic modeling to get the two-level feature representation of text data and to generate different ensemble components. By using both topics and words to cluster text data, we can get more interpretable clusters as we can measure the weight of words and topics in each cluster. In order to evaluate the proposed method, we have conducted several experiments on seven real-life data sets. While some of these data sets are easy to cluster, others are hard, and some others contain unbalanced data. Experimental results on this diversity of data sets show that our method outperforms other methods for ensemble clustering.

Full Text