Efficient exploratory clustering analyses in large-scale exploration processes

Manuel Fritz,Michael Behringer,Holger Schwarz,Dennis Tschechlov

doi:10.1007/s00778-021-00716-y

Manuel Fritz, Michael Behringer + Show 2 more

Open Access

https://doi.org/10.1007/s00778-021-00716-y

Copy DOI

Journal: The VLDB Journal	Publication Date: Nov 29, 2021
Citations: 1	License type: open-access

Affiliation: University of Stuttgart

Abstract

Clustering is a fundamental primitive in manifold applications. In order to achieve valuable results in exploratory clustering analyses, parameters of the clustering algorithm have to be set appropriately, which is a tremendous pitfall. We observe multiple challenges for large-scale exploration processes. On the one hand, they require specific methods to efficiently explore large parameter search spaces. On the other hand, they often exhibit large runtimes, in particular when large datasets are analyzed using clustering algorithms with super-polynomial runtimes, which repeatedly need to be executed within exploratory clustering analyses. We address these challenges as follows: First, we present LOG-Means and show that it provides estimates for the number of clusters in sublinear time regarding the defined search space, i.e., provably requiring less executions of a clustering algorithm than existing methods. Second, we demonstrate how to exploit fundamental characteristics of exploratory clustering analyses in order to significantly accelerate the (repetitive) execution of clustering algorithms on large datasets. Third, we show how these challenges can be tackled at the same time. To the best of our knowledge, this is the first work which simultaneously addresses the above-mentioned challenges. In our comprehensive evaluation, we unveil that our proposed methods significantly outperform state-of-the-art methods, thus especially supporting novice analysts for exploratory clustering analyses in large-scale exploration processes.

Highlights

Clustering is a fundamental primitive for exploratory tasks
Since centroid-based clustering algorithms are at the core of the above-mentioned estimation methods, we focus on their procedure in the following
We showed in our previous work that Delta Initialization achieves even more valuable clustering results than initializing via kMeans [24], since it exploits previous clustering results, where the position of the centroids are optima, which only emerge after the execution of the clustering algorithm

Summary

Introduction

Clustering is a fundamental primitive for exploratory tasks. Manifold application domains rely on clustering techniques: In computer vision, image segmentation tasks can be formulated as a clustering problem [21,48]. For business purposes, clustering may be used for grouping customers, for workforce management or for planning tasks [32,45]. Jain identified three main general purposes of clustering throughout these and many more application domains, which emphasize the exploratory power of clustering analyses [37]: (i) Assessing the structure of the data. The goal is to exploit clustering to gain a better understanding of data, to generate hypotheses, or to detect anomalies. Clustering aims to group similar entities into the same cluster. Previously unseen entities can be assigned to a specific cluster. Previously unseen entities can be assigned to a specific cluster. (iii) Compressing data, i.e., to use the clusters and their information as summary of the data for further steps

Objectives

Methods

Results

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Efficient exploratory clustering analyses in large-scale exploration processes

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: The VLDB Journal

Lead the way for us

Similar Papers

LOG-Means
Manuel Fritz ... Michael Behringer
Proceedings of the VLDB Endowment | VOL. 13
Manuel Fritz, et. al.Manuel Fritz ... Michael Behringer
01 Aug 2020
Proceedings of the VLDB Endowment | VOL. 13

Evolutionary Algorithm Based Techniques to Handle Big Data
Ghosh Sanchita ... Desarkar Anindita
-
Ghosh Sanchita, et. al.Ghosh Sanchita ... Desarkar Anindita
01 Jan 2015
01 Jan 2015

TIGRIS: An Informed Sampling-based Algorithm for Informative Path Planning
Brady Moon ... Satrajit Chatterjee
-
Brady Moon, et. al.Brady Moon ... Satrajit Chatterjee
23 Oct 2022
23 Oct 2022

Multi-agent framework for real-time processing of large and dynamic search spaces
John Korah ... Eugene Santos
-
John Korah, et. al.John Korah ... Eugene Santos
26 Mar 2012
26 Mar 2012

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Efficient exploratory clustering analyses in large-scale exploration processes

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: The VLDB Journal