Approximate Partitional Clustering Through Systematic Sampling in Big Data Mining

Kamlesh Kumar Pandey,Diwakar Shukla

doi:10.1007/978-981-16-1220-6_19

Abstract

AbstractBig data mining is an intelligent process of extracting hidden knowledge from high volume, high variety, and high velocity data environments for decision-making systems. Classical data mining algorithms are facing memory utilization, speed-up, scale-up, computing cost, efficiency, and effectiveness related challenges inside the big data. Data volume is a prime attribute of big data mining and is responsible for variety and velocity-related challenges. Intelligent big data mining process incorporates classical data mining and statistics under single and multiple machine execution environments. Sampling is a data reduction technique that handles data volume-related challenges and increases the speed, scalability, flexibility, accuracy, quality, efficiency, and utilizes memory resources for any data mining algorithms without the influence of their characteristics. This paper proposed the systematic sampling-based big data mining model through the K-means clustering that is known as SYK-means (systematic sampling-based K-means). The experimental results of the SYK-means algorithm are compared with the RSK-means (random sampling-based K-means) and classical K-means algorithms concerning sample size selection and entire data selection. The experimental evaluation of the SYK-means algorithm achieved better effectiveness and efficiency through R squares, root-mean-square standard deviation, Davies Bouldin, Calinski Harabasz, Silhouette coefficient, CPU time, and convergence validation indices.KeywordsBig data characteristicsBig data miningBig data clusteringData reductionRandom samplingSystematic samplingSYK-means

Full Text