Heterogeneous Distributed Big Data Clustering on Sparse Grids

David Pfander,Gregor Daiß,Dirk Pflüger

doi:10.3390/a12030060

Abstract

Clustering is an important task in data mining that has become more challenging due to the ever-increasing size of available datasets. To cope with these big data scenarios, a high-performance clustering approach is required. Sparse grid clustering is a density-based clustering method that uses a sparse grid density estimation as its central building block. The underlying density estimation approach enables the detection of clusters with non-convex shapes and without a predetermined number of clusters. In this work, we introduce a new distributed and performance-portable variant of the sparse grid clustering algorithm that is suited for big data settings. Our computed kernels were implemented in OpenCL to enable portability across a wide range of architectures. For distributed environments, we added a manager–worker scheme that was implemented using MPI. In experiments on two supercomputers, Piz Daint and Hazel Hen, with up to 100 million data points in a ten-dimensional dataset, we show the performance and scalability of our approach. The dataset with 100 million data points was clustered in 1198 s using 128 nodes of Piz Daint. This translates to an overall performance of 352 TFLOPS . On the node-level, we provide results for two GPUs, Nvidia’s Tesla P100 and the AMD FirePro W8100, and one processor-based platform that uses Intel Xeon E5-2680v3 processors. In these experiments, we achieved between 43% and 66% of the peak performance across all computed kernels and devices, demonstrating the performance portability of our approach.

Highlights

In data mining, cluster analysis partitions a dataset according to a given measure of similarity.The partitions obtained as a result of the clustering process are called clusters
We introduce a new distributed and performance-portable variant of the sparse grid clustering algorithm
Due to the underlying method and the high-performance distributed approach, we show that sparse grid clustering is well-suited for large datasets

Summary

Introduction

Cluster analysis partitions a dataset according to a given measure of similarity. The partitions obtained as a result of the clustering process are called clusters. Mapping clustering approaches to modern hardware platforms such as graphics processing units (GPUs) requires new parallel approaches. The classic k-means algorithm iteratively improves an initial guess of cluster centers [1]. Efficient variants of the k-means algorithm have been proposed, e.g., by using domain partitioning through k-d-trees [2] or by a more careful selection of the initial cluster centers [3]. K-means requires the number of clusters to be known in advance, which is not always possible. In contrast to many alternatives, k-means cannot detect clusters with non-convex shape

Methods

Results

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Algorithms	Publication Date: Mar 7, 2019
Citations: 4	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Heterogeneous Distributed Big Data Clustering on Sparse Grids

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Algorithms

Lead the way for us

Similar Papers

Classification with sparse grids using simplicial basis functions
Jochen Garcke ... Michael Griebel
Intelligent Data Analysis | VOL. 6
Jochen Garcke, et. al.Jochen Garcke ... Michael Griebel
27 Dec 2002
Intelligent Data Analysis | VOL. 6

Using Gaussian Mixture Models to Detect Outliers in Seasonal Univariate Network Traffic
Aarthi Reddy ... Joshua Whitney
-
Aarthi Reddy, et. al.Aarthi Reddy ... Joshua Whitney
01 May 2017
01 May 2017

From Big to Smart Data: Iterative ensemble filter for noise filtering in Big Data classification
Diego García‐Gil ... Francisco Herrera
International Journal of Intelligent Systems | VOL. 34
Diego García‐Gil, et. al.Diego García‐Gil ... Francisco Herrera
09 Oct 2019
International Journal of Intelligent Systems | VOL. 34

Density Estimation with Adaptive Sparse Grids for Large Data Sets
Benjamin Peherstorfer ... Dirk Pflüge
-
Benjamin Peherstorfer, et. al.Benjamin Peherstorfer ... Dirk Pflüge
28 Apr 2014
28 Apr 2014

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Heterogeneous Distributed Big Data Clustering on Sparse Grids

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Algorithms