Parallel and scalable Dunn Index for the validation of big data clusters

Chiheb-Eddine Ben Ncir,Abdallah Hamza,Waad Bouaguel

doi:10.1016/j.parco.2021.102751

Chiheb-Eddine Ben Ncir, Abdallah Hamza + Show 1 more

https://doi.org/10.1016/j.parco.2021.102751

Copy DOI

Export

Save

Cite

Abstract
Full-Text
Similar Papers

Abstract

Listen

Parallelizing data clustering algorithms has attracted the interest of many researchers over the past few years. Many efficient parallel algorithms were proposed to build partitioning over a huge volume of data. The effectiveness of these algorithms is attributed to the distribution of data among a cluster of nodes and to the parallel computation models. Although the effectiveness of parallel models to deal with increasing volume of data little work is done on the validation of big clusters. To deal with this issue, we propose a parallel and scalable model, referred to as S-DI (Scalable Dunn Index), to compute the Dunn Index measure for an internal validation of clustering results. Rather than computing the Dunn Index on a single machine in the clustering validation process, the new proposed measure is computed by distributing the partitioning among a cluster of nodes using a customized parallel model under Apache Spark framework. The proposed S-DI is also enhanced by a Sketch and Validate sampling technique which aims to approximate the Dunn Index value by using a small representative data-sample. Different experiments on simulated and real datasets showed a good scalability of our proposed measure and a reliable validation compared to other existing measures when handling large scale data.

Full Text