Abstract

In various disciplines, hierarchical clustering (HC) has been an effective tool for data analysis due to its ability to summarize hierarchical structures of data in an intuitive and interpretable manner. A run of HC requires multiple iterations, each of which needs to compute and update the pairwise distances between all intermediate clusters. This makes the exact algorithm for HC inevitably suffer from quadratic time and space complexities. To address large-scale data, various approximate/parallel algorithms have been proposed to reduce the computational cost of HC. However, such algorithms still rely on conventional linkage methods (such as single, centroid, average, complete, or Ward's) for defining pairwise distances, mostly focusing on the approximation/parallelization of linkage computations. Given that the choice of linkage profoundly affects not only the quality but also the efficiency of HC, we propose a new linkage method named NC-link and design an exact algorithm for NC-link-based HC. To guarantee the exactness, the proposed algorithm maintains the quadratic nature in time complexity but exhibits only linear space complexity, thereby allowing us to address million-object data on a personal computer. To underpin the extensibility of our approach, we showthat the algorithmic nature of NC-link enables single instruction multiple data (SIMD) parallelization and subquadratic-time approximation of HC. To verify our proposal, we thoroughly tested it with a number of large-scale real and synthetic data sets. In terms of efficiency, NC-link allowed us to perform HC substantially more space efficiently or faster than conventional methods: compared with average and complete linkages, using NC-link incurred only 0.7%-1.75% of the memory usage, and the NC-link-based implementation delivered speedups of approximately 3.5 times over the centroid and Ward's linkages. With regard to clustering quality, the proposed method was able to retrieve hierarchical structures from input data as faithfully as in the popular average and centroid linkage methods. We anticipate that the existing approximation/parallel algorithms will be able to benefit from adopting NC-link as their linkage method for obtaining better clustering results and reduced time and space demands.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call