A Performance Comparison of Clustering Algorithms for Big Data on DataMPI

Mo Hai

doi:10.1007/978-981-15-2810-1_33

Abstract

Clustering algorithms for big data have important applications in finance. DataMPI is a communication library based on key-value pairs that extends MPI for Hadoop and Spark. We study the performance of K-means, fuzzy K-means and Canopy clustering algorithms on the DataMPI cluster by experiments. Firstly, we observe the influence of the number of nodes on the clustering time and scaleup; and then we observe the influence of the size of the memory of each node on the clustering time and memoryup; at the same time, we compare the performance of these three clustering algorithms on different text data set. From experimental results we can find that: (1) When the size of data set, the size of the memory, and the number of nodes keep the same, Canopy is the fastest, followed by K-means, and the fuzzy K-means is the slowest; (2) When the size of the memory of each node is fixed, these three algorithms have a good scaleup on all of text data set, which shows that the increase of the number of nodes can significantly improve the efficiency of these three algorithms; (3) When the number of nodes is fixed, and as the size of the memory is increased from 1 GB to 4 GB, the clustering time is significantly decreased, which shows that these three clustering algorithms have a good memoryup.

Full Text