Abstract
Clustering represents a fundamental procedure to provide users with meaningful insights from an original data set. The quality of the resulting clusters is largely dependent on the correct estimation of their number, K∗, which must be provided as an input parameter in many clustering algorithms. Only very few techniques provide an automatic detection of K∗ and are usually based on cluster validity indexes which are expensive with regard to computation time. Here, we present a new algorithm which allows one to obtain an accurate estimate of K∗, without partitioning data into the different clusters. This makes the algorithm particularly efficient in handling large-scale data sets from both the perspective of time and space complexity. The algorithm, indeed, highlights the block structure which is implicitly present in the similarity matrix, and associates K∗ to the number of blocks in the matrix. We test the algorithm on synthetic data sets with or without a hierarchical organization of elements. We explore a wide range of K∗ and show the effectiveness of the proposed algorithm to identify K∗, even more accurate than existing methods based on standard internal validity indexes, with a huge advantage in terms of computation time and memory storage. We also discuss the application of the novel algorithm to the de-clustering of instrumental earthquake catalogs, a procedure finalized to identify the level of background seismic activity useful for seismic hazard assessment.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
More From: Physica A: Statistical Mechanics and its Applications
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.