Abstract
Clustering is an important task in data mining that has become more challenging due to the ever-increasing size of available datasets. To cope with these big data scenarios, a high-performance clustering approach is required. Sparse grid clustering is a density-based clustering method that uses a sparse grid density estimation as its central building block. The underlying density estimation approach enables the detection of clusters with non-convex shapes and without a predetermined number of clusters. In this work, we introduce a new distributed and performance-portable variant of the sparse grid clustering algorithm that is suited for big data settings. Our computed kernels were implemented in OpenCL to enable portability across a wide range of architectures. For distributed environments, we added a manager–worker scheme that was implemented using MPI. In experiments on two supercomputers, Piz Daint and Hazel Hen, with up to 100 million data points in a ten-dimensional dataset, we show the performance and scalability of our approach. The dataset with 100 million data points was clustered in 1198 s using 128 nodes of Piz Daint. This translates to an overall performance of 352 TFLOPS . On the node-level, we provide results for two GPUs, Nvidia’s Tesla P100 and the AMD FirePro W8100, and one processor-based platform that uses Intel Xeon E5-2680v3 processors. In these experiments, we achieved between 43% and 66% of the peak performance across all computed kernels and devices, demonstrating the performance portability of our approach.
Highlights
In data mining, cluster analysis partitions a dataset according to a given measure of similarity.The partitions obtained as a result of the clustering process are called clusters
We introduce a new distributed and performance-portable variant of the sparse grid clustering algorithm
Due to the underlying method and the high-performance distributed approach, we show that sparse grid clustering is well-suited for large datasets
Summary
Cluster analysis partitions a dataset according to a given measure of similarity. The partitions obtained as a result of the clustering process are called clusters. Mapping clustering approaches to modern hardware platforms such as graphics processing units (GPUs) requires new parallel approaches. The classic k-means algorithm iteratively improves an initial guess of cluster centers [1]. Efficient variants of the k-means algorithm have been proposed, e.g., by using domain partitioning through k-d-trees [2] or by a more careful selection of the initial cluster centers [3]. K-means requires the number of clusters to be known in advance, which is not always possible. In contrast to many alternatives, k-means cannot detect clusters with non-convex shape
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.