Abstract

Clustering methods have led to a number of important discoveries in bioinformatics and beyond. A major challenge in their use is determining which clusters represent important underlying structure, as opposed to spurious sampling artifacts. This challenge is especially serious, and very few methods are available, when the data are very high in dimension. Statistical significance of clustering (SigClust) is a recently developed cluster evaluation tool for high-dimensional low sample size (HDLSS) data. An important component of the SigClust approach is the very definition of a single cluster as a subset of data sampled from a multivariate Gaussian distribution. The implementation of SigClust requires the estimation of the eigenvalues of the covariance matrix for the null multivariate Gaussian distribution. We show that the original eigenvalue estimation can lead to a test that suffers from severe inflation of Type I error, in the important case where there are a few very large eigenvalues. This article addresses this critical challenge using a novel likelihood based soft thresholding approach to estimate these eigenvalues, which leads to a much improved SigClust. Major improvements in SigClust performance are shown by both mathematical analysis, based on the new notion of theoretical cluster index (TCI), and extensive simulation studies. Applications to some cancer genomic data further demonstrate the usefulness of these improvements.

Highlights

  • Clustering methods have been broadly applied in many fields including biomedical and genetic research

  • Clustering is an important example of unsupervised learning, in the sense that there are no class labels provided for the analysis

  • Given a clustering of the vectors in X, i.e. sets C1 and C2, where C1 ∪ C2 = {1, ..., n} and C1 and C2 are disjoint, the strength of the clusters can be assessed using the two means cluster index (CI), which is the sum of the within class variation divided by the total variation

Read more

Summary

Introduction

Clustering methods have been broadly applied in many fields including biomedical and genetic research. They aim to find data structure by identifying groups that are similar in some sense. Clustering is a common step in the exploratory analysis of data. Clustering is an important example of unsupervised learning, in the sense that there are no class labels provided for the analysis. Clustering algorithms can give any desired number of clusters, which on some occasions have yielded important scientific discoveries, but can be quite spurious. This motivates some natural cluster evaluation questions such as:

Objectives
Methods
Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.