Abstract
HDBSCAN*, a state-of-the-art density-based hierarchical clustering method, produces a hierarchical organization of clusters in a dataset w.r.t. a parameter mpts. While a small change in mpts typically leads to a small change in the clustering structure, choosing a “good” mpts value can be challenging: depending on the data distribution, a high or low mpts value may be more appropriate, and certain clusters may reveal themselves at different values. To explore results for a range of mpts values, one has to run HDBSCAN* for each value independently, which can be computationally impractical. In this paper, we propose an approach to efficiently compute all HDBSCAN* hierarchies for a range of mpts values by building upon results from computational geometry to replace HDBSCAN*'s complete graph with a smaller equivalent graph. An experimental evaluation shows that our approach can obtain over one hundred hierarchies for the computational cost equivalent to running HDBSCAN* about twice, which corresponds to a speedup of more than 60 times, compared to running HDBSCAN* independently that many times. We also propose a series of visualizations that allow users to analyze a collection of hierarchies for a range of mpts values, along with case studies that illustrate how these analyses are performed.
Highlights
THE discovery of groups within datasets plays an important role in the exploration and analysis of data
For the EMST, there are known results from computational geometry that relate the EMST to the Delaunay Triangulation (DT), the Gabriel Graph (GG) and the Relative Neighborhood Graph in the following way [32]: EMST RNG GG DT: (3)
In this paper we presented RNG-HDBSCAN*, an efficient strategy for computing multiple density-based clustering hierarchies
Summary
THE discovery of groups within datasets plays an important role in the exploration and analysis of data. HDBSCAN*, the current state-of-the-art among those, computes a hierarchy of nested clusters, representing clusters at different density levels It generalizes and improves several aspects of previous algorithms, and allows for a comprehensive framework for cluster analysis, visualization, and unsupervised outlier detection [9]. HDBSCAN* stands out for its ability to detect arbitraryshaped clusters, and noise, as well as for building a hierarchical organization of cluster structures, rather than finding a single flat partitioning of the data It can be considered a practical and theoretical generalization of its predecessors (DBSCAN and OPTICS). These characteristics allow us to devise and prove the correctness of the strategy proposed in this work
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
More From: IEEE Transactions on Knowledge and Data Engineering
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.