Abstract

In this paper, we propose a new parallel algorithm for text document clustering based on the concept of neighbor (Guha et al. in Inf Syst 25(5):345---366, 2000). If two documents are similar enough, they are considered as neighbors of each other. The new algorithm is named parallel k-means based on neighbors (PKBN), and it is a parallel version of sequential k-means based on neighbors (SKBN) that we proposed in Luo et al. (Data Knowl Eng 68(11):1271---1288, 2009). PKBN fully exploits the data-parallelism of SKBN and adopts a new parallel pair-generating method to build the neighbor matrix. Our new parallel pair-generating method causes less communication overhead between processors than existing methods. PKBN is designed for message-passing multiprocessor systems and is implemented on a cluster of Linux workstations to analyze its performance. Our experimental results on real-life data sets demonstrate that PKBN is very efficient and has good scalability with respect to the number of processors and the size of data set.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call