Abstract

Clustering algorithms for big data have important applications in finance. DataMPI is a communication library based on key-value pairs that extends MPI for Hadoop and Spark. We study the performance of K-means, fuzzy K-means and Canopy clustering algorithms on the DataMPI cluster by experiments. Firstly, we observe the influence of the number of nodes on the clustering time and scaleup; and then we observe the influence of the size of the memory of each node on the clustering time and memoryup; at the same time, we compare the performance of these three clustering algorithms on different text data set. From experimental results we can find that: (1) When the size of data set, the size of the memory, and the number of nodes keep the same, Canopy is the fastest, followed by K-means, and the fuzzy K-means is the slowest; (2) When the size of the memory of each node is fixed, these three algorithms have a good scaleup on all of text data set, which shows that the increase of the number of nodes can significantly improve the efficiency of these three algorithms; (3) When the number of nodes is fixed, and as the size of the memory is increased from 1 GB to 4 GB, the clustering time is significantly decreased, which shows that these three clustering algorithms have a good memoryup.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.