Graph and Centroid-based Word Clustering

Santipong Thaiprayoon,Mario Kubek,Herwig Unger

doi:10.1145/3443279.3443290

Abstract

With the recent exponential growth of text documents, a word clustering algorithm is an essential approach for making a reduction in a huge amount of text data and unsupervised feature selection on the domain of natural language processing. This paper proposes a novel method of the graph and centroid-based word clustering. The proposed method aims to automatically group similar words into the same cluster and handles noisy text and outliers. The proposed method applies the concepts of the hierarchical agglomerative clustering and K-means algorithm to find similar words according to the criterion of distance range on the co-occurrence graph. The small clusters and isolated words are also merged into another cluster. The experimental results demonstrate that the proposed method consistently and significantly outperforms state-of-the-art baselines in word clustering algorithms on the ground truth dataset. Besides, the proposed method is unsupervised learning and generic, which could be applied to various tasks of natural language processing and text mining.

Full Text