HEADSS: HiErArchical Data Splitting and Stitching software for non-distributed clustering algorithms

D.A Crake,N.C Hambly,R.G Mann

doi:10.1016/j.ascom.2023.100709

Abstract

The increase in data volume is challenging the suitability of non-distributed and non-scalable algorithms, despite advancements in hardware. An example of this challenge is clustering. Considering that optimal clustering algorithms scale poorly with increased data volume or are intrinsically non-distributed, accurate clustering of large datasets is increasingly resource-heavy, relying on substantial and expensive compute nodes. This scenario forces users to choose between accuracy and scalability. In this work, we introduce HiErArchical Data Splitting and Stitching (HEADSS), a Python package designed to facilitate clustering at scale. By automating the splitting and stitching, it allows repeatable handling, and removal, of edge effects. We implement HEADSS in conjunction with HDBSCAN, where we achieve orders of magnitude reduction in single node memory requirements for both non-distributed and distributed implementations, with the latter offering similar order of magnitude reductions in total run times while recovering analogous accuracy. Furthermore, our method establishes a hierarchy of features by using a subset of clustering features to split the data.11Source code and examples are available at https://github.com/D-Crake/HEADSS.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

HEADSS: HiErArchical Data Splitting and Stitching software for non-distributed clustering algorithms

Abstract

Talk to us

Similar Papers

More From: Astronomy and Computing

Lead the way for us

Journal: Astronomy and Computing	Publication Date: Apr 1, 2023
License type: cc-by

Similar Papers

Memory Efficiency
Gertrud S Joachim
Journal of the ACM | VOL. 6
Gertrud S JoachimGertrud S Joachim
01 Apr 1959
Journal of the ACM | VOL. 6

Strategies to minimise the total run time of cyclic graph based genetic programming with GPUs
Tony E Lewis ... George D Magoulas
-
Tony E Lewis, et. al.Tony E Lewis ... George D Magoulas
08 Jul 2009
08 Jul 2009

English
E Judith J ... Jayakumari J
Scientific Research and Essays | VOL. 10
E Judith J, et. al.E Judith J ... Jayakumari J
15 Jan 2015
Scientific Research and Essays | VOL. 10

Strategies for Implementing Hardware-Assisted High-Throughput Cellular Image Analysis
Henry Tat Kwong Tse ... Ryan Kastner
JALA: Journal of the Association for Laboratory Automation | VOL. 16
Henry Tat Kwong Tse, et. al.Henry Tat Kwong Tse ... Ryan Kastner
01 Dec 2011
JALA: Journal of the Association for Laboratory Automation | VOL. 16

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

HEADSS: HiErArchical Data Splitting and Stitching software for non-distributed clustering algorithms

Abstract

Talk to us

Similar Papers

More From: Astronomy and Computing