A parallel text document clustering algorithm based on neighbors

Yanjun Li,Congnan Luo,Soon M Chung

doi:10.1007/s10586-015-0450-z

Abstract

In this paper, we propose a new parallel algorithm for text document clustering based on the concept of neighbor (Guha et al. in Inf Syst 25(5):345---366, 2000). If two documents are similar enough, they are considered as neighbors of each other. The new algorithm is named parallel k-means based on neighbors (PKBN), and it is a parallel version of sequential k-means based on neighbors (SKBN) that we proposed in Luo et al. (Data Knowl Eng 68(11):1271---1288, 2009). PKBN fully exploits the data-parallelism of SKBN and adopts a new parallel pair-generating method to build the neighbor matrix. Our new parallel pair-generating method causes less communication overhead between processors than existing methods. PKBN is designed for message-passing multiprocessor systems and is implemented on a cluster of Linux workstations to analyze its performance. Our experimental results on real-life data sets demonstrate that PKBN is very efficient and has good scalability with respect to the number of processors and the size of data set.

Full Text