An Improved Parallelization of K-means Algorithm based on HADOOP

Yizhuo Guo

doi:10.1088/1742-6596/1187/4/042029

Abstract

In order to improve the problem that the single-machine serial programming model is not ideal for mass data clustering, we combine big data technology with text clustering related technologies. Implement distributed storage and calculations for text data, parallelization of text vectors and parallel clustering using clustering algorithms based on the Map Reduce programming model. The traditional k-means clustering algorithm is a typical algorithm for solving clustering problems. It has better with good scalability and scalability for processing large data sets, but the initial center of the algorithm is chosen randomly, and the algorithm is unstable every time. To solve the above problems, firstly, based on the idea of density segmentation and sampling thought, the initial clustering center is selected and optimized. Secondly, parallel sampling of the data set to find the best candidate cluster center by referring to the sample maximum and minimum method search and consolidate data objects in parallel. Finally, the selected initial cluster center is replaced by the central point randomly selected by the k-means algorithm, and the clustering algorithm is parallelized. Experiments show that the improved k-means algorithm can effectively reduce the number of iterations.

Full Text