IMPROVED PARALLEL BIG DATA CLUSTERING BASED ON K-MEDOIDS AND K-MEANS ALGORITHMS

Rasim Alguliyev,Ramiz Aliguliyev,Lyudmila Sukhostat

doi:10.25045/jpit.v15.i1.03

Abstract

In recent years, the amount of data created worldwide has grown exponentially. The increase in computational complexity when working with "Big data" leads to the need to develop new approaches for their clustering. The problem of massive data amounts clustering can be solved using parallel processing. Dividing the data into batches helps to perform clustering in a reasonable time. In this case, the reliability of the obtained result for each block will affect the performance of the entire dataset. The main idea of the proposed approach is to apply the k-medoids and k-means algorithms to parallel Big data clustering. The advantage of this hybrid approach is that it is based on the central object in the cluster and is less sensitive to outliers than k-means clustering. Experiments are conducted on real datasets, namely YearPredictionMSD and Phone Accelerometer. The proposed approach is compared with the k-means and MiniBatch k-means algorithms. Experimental results proved that the proposed parallel implementation of k-medoids with the k-means algorithm shows greater accuracy and works faster than the k-means algorithm.

Full Text