Chapter 27 - D-PPSOK clustering algorithm with data sampling for clustering big data analysis

C Suresh Gnana Dhas,N Yuvaraj,N.V Kousik,Tadele Degefa Geleto

doi:10.1016/b978-0-323-90240-3.00027-8

Abstract

Clustering is an essential data mining and tool for investigating big data. There are difficulties in applying clustering techniques to big data due to new drawbacks which are elevated with big data. As Big Data is referring to terabytes and petabytes of data and basically the clustering algorithms use great computational costs, here have to consider that the question is how to cope with this problem and how to deploy clustering methods to big data and acquire the outcomes in a reasonable time. Clustering is an essential analysis area in the data processing. In several decades, k-means lingers the most popular clustering algorithm because of its simplicity. Recently, as data volume continues to raise called large data (Big Data); many researchers address various clustering algorithms for big data to get high performance. This chapter proposes the distributed-parallel particle swarm optimization with k-means (D-PPSOK) clustering algorithm with data sampling on large data sets for getting the clusters with less computational steps. According to the experimental results, the proposed D-PPSOK with data sampling on large-scale data set gives high performance compared to the existing algorithms.

Full Text