Statistical Significance of Clustering Using Soft Thresholding

Hanwen Huang,Yufeng Liu,Ming Yuan,J S Marron

doi:10.1080/10618600.2014.948179

Abstract

Clustering methods have led to a number of important discoveries in bioinformatics and beyond. A major challenge in their use is determining which clusters represent important underlying structure, as opposed to spurious sampling artifacts. This challenge is especially serious, and very few methods are available, when the data are very high in dimension. Statistical significance of clustering (SigClust) is a recently developed cluster evaluation tool for high-dimensional low sample size (HDLSS) data. An important component of the SigClust approach is the very definition of a single cluster as a subset of data sampled from a multivariate Gaussian distribution. The implementation of SigClust requires the estimation of the eigenvalues of the covariance matrix for the null multivariate Gaussian distribution. We show that the original eigenvalue estimation can lead to a test that suffers from severe inflation of Type I error, in the important case where there are a few very large eigenvalues. This article addresses this critical challenge using a novel likelihood based soft thresholding approach to estimate these eigenvalues, which leads to a much improved SigClust. Major improvements in SigClust performance are shown by both mathematical analysis, based on the new notion of theoretical cluster index (TCI), and extensive simulation studies. Applications to some cancer genomic data further demonstrate the usefulness of these improvements.

Highlights

Clustering methods have been broadly applied in many fields including biomedical and genetic research
Clustering is an important example of unsupervised learning, in the sense that there are no class labels provided for the analysis
Given a clustering of the vectors in X, i.e. sets C1 and C2, where C1 ∪ C2 = {1, ..., n} and C1 and C2 are disjoint, the strength of the clusters can be assessed using the two means cluster index (CI), which is the sum of the within class variation divided by the total variation

Summary

Introduction

Clustering methods have been broadly applied in many fields including biomedical and genetic research. They aim to find data structure by identifying groups that are similar in some sense. Clustering is a common step in the exploratory analysis of data. Clustering is an important example of unsupervised learning, in the sense that there are no class labels provided for the analysis. Clustering algorithms can give any desired number of clusters, which on some occasions have yielded important scientific discoveries, but can be quite spurious. This motivates some natural cluster evaluation questions such as:

Objectives

Methods

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Journal of Computational and Graphical Statistics	Publication Date: Oct 2, 2015
Citations: 63	License type: cc-by

R Discovery Prime

R Discovery Prime

Statistical Significance of Clustering Using Soft Thresholding

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Journal of Computational and Graphical Statistics

Lead the way for us

Similar Papers

Statistical Significance of Clustering Using Soft Thresholding
...
Carolina Digital Repository (University of North Carolina at Chapel Hill) | VOL. -
, et. al. ...
25 May 2013
Carolina Digital Repository (University of North Carolina at Chapel Hill) | VOL. -

High dimensional low sample size activity recognition using geometric classifiers
Muhammad Shahzad Cheema ... Christian Bauckhage
Digital Signal Processing | VOL. 42
Muhammad Shahzad Cheema, et. al.Muhammad Shahzad Cheema ... Christian Bauckhage
22 Apr 2015
Digital Signal Processing | VOL. 42

An effective feature selection method based on pair-wise feature proximity for high dimensional low sample size data
S L Happy ... Aurobinda Routray
-
S L Happy, et. al.S L Happy ... Aurobinda Routray
01 Aug 2017
01 Aug 2017

Partition clustering of high dimensional low sample size data based on [formula omitted]-values
George Von Borries ... Haiyan Wang
Computational Statistics & Data Analysis | VOL. 53
George Von Borries, et. al.George Von Borries ... Haiyan Wang
26 Jun 2009
Computational Statistics & Data Analysis | VOL. 53

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Statistical Significance of Clustering Using Soft Thresholding

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Journal of Computational and Graphical Statistics