Improved K-Means Clustering Algorithm for Big Data Mining under Hadoop Parallel Framework

Weijia Lu

doi:10.1007/s10723-019-09503-0

Abstract

In order to improve the accuracy and efficiency of the clustering mining algorithm, this paper focuses on the clustering mining algorithm for large data. Firstly, the traditional clustering mining algorithm is improved to improve the accuracy, and then the improved clustering algorithm is parallelized to improve the efficiency. In order to improve the accuracy of clustering, an incremental K-means clustering algorithm based on density is proposed on the basis of K-means algorithm. Firstly, the density of data points is calculated, and each basic cluster is composed of the center points whose density is not less than the given threshold and the points within the density range. Then, the basic cluster is merged according to the distance between the two cluster centers. Finally, the points that are not divided into any cluster are divided into the clusters nearest to them. In order to improve the efficiency of the algorithm and reduce the time complexity of the algorithm, the distributed database was used to simulate the shared memory space and parallelize the algorithm on the Hadoop platform of cloud computing. The simulation results show that the clustering accuracy of the proposed algorithm is higher than that of the other two algorithms by more than 10%.

Full Text