Abstract
Hierarchical clustering problem is a traditional topic in computer science, which aims to discover a consistent hierarchy of clusters with different granularities. One of the most important open questions on hierarchical clustering is the identification of the meaningful clustering levels in the hierarchical structure. In this paper, we answer this question from algorithmic point of view. In particular, we derive a quantitative analysis on the impact of the low-level clustering costs on high level clusters, when agglomerative algorithms are run to construct the hierarchy. This analysis enables us to find meaningful clustering levels, which are independent of the clusters hierarchically beneath it. We thus propose a general agglomerative hierarchical clustering framework, which automatically constructs meaningful clustering levels. This framework is proven to be generally applicable to any k-clustering problem in any α-relaxed metric space, in which strict triangle inequality is relaxed within some constant factor α. To fully utilize the hierarchical clustering framework, we conduct some case studies on k-median and k-means clustering problems, in both of which our framework achieves better approximation factor than the state-of-the-art methods. We also extend our framework to handle the data stream clustering problem, which allows only one scan on the whole data set. By incorporating our framework into Guha’s data stream clustering algorithm, the clustering quality is greatly enhanced with only small extra computation cost incurred. The extensive experiments show that our proposal is superior to the distance based agglomerative hierarchical clustering and data stream clustering algorithms on a variety of data sets.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.