In big data clustering exploration, the situation is paradoxical because there is no prior or insufficient domain knowledge. Moreover, clustering a big dataset is a challenging task in distributed computing framework. To address this, we propose a new distributed clustering approximation framework for big data with quality guarantees. This innovative framework uses multiple disjoint random samples instead of a single random sample to compute an ensemble result as the estimation of the true result of the entire big dataset. To begin, we modeled a large dataset as a collection of random sample data blocks stored in a distributed file system. Henceafter, a subset of data blocks is randomly selected, and to generate the component clustering results, the serial clustering algorithm is executed in parallel on the distributed computing framework. In each selected random sample, the number of clusters and initial centroids is identified using a density peak-based I-niceDP clustering algorithm, and then the k-means sweep refines them. Since the random samples are disjoint and traditional consensus functions cannot be used, we propose two new methods, a graph similarity and a naturally inspired firefly-based algorithm, to integrate the component clustering results into the final ensemble result. The entire clustering process is displayed through systematic support, extensive measures of clusterability, and quality evaluation. The methods are verified in a series of experiments using synthetic and real-world datasets. Our comprehensive experimental results demonstrate that the proposed methods vividly (1) recognize the correct number of clusters by analyzing a subset of samples and (2) exhibit better scalability, efficiency, and clustering stability.
Read full abstract