Abstract

Estimating the number of clusters is one of the most difficult problems in cluster analysis. Most previous approaches require knowing the data matrix and may not work when only a Euclidean distance matrix is available. Other approaches also suffer from the curse of dimensionality and work poorly in high dimension. In this article, we develop a new statistic, called the GUD statistic, based on the idea of the Gap method, but use the determinant of the pooled within-group scatter matrix instead of the within-cluster sum of squared distances. Some theory is developed to show this statistic can work well when only the Euclidean distance matrix is known. More generally, this statistic can even work for any dissimilarity matrix that satisfies some properties. We also propose a modification for high-dimensional datasets, called the R-GUD statistic, which can give a robust estimation in high-dimensional settings. The simulation shows our method needs less information but is generally found to be more accurate and robust than other methods considered in the study, especially in many difficult settings.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.