Abstract

BackgroundCluster analysis has been widely applied for investigating structure in bio-molecular data. A drawback of most clustering algorithms is that they cannot automatically detect the "natural" number of clusters underlying the data, and in many cases we have no enough "a priori" biological knowledge to evaluate both the number of clusters as well as their validity. Recently several methods based on the concept of stability have been proposed to estimate the "optimal" number of clusters, but despite their successful application to the analysis of complex bio-molecular data, the assessment of the statistical significance of the discovered clustering solutions and the detection of multiple structures simultaneously present in high-dimensional bio-molecular data are still major problems.ResultsWe propose a stability method based on randomized maps that exploits the high-dimensionality and relatively low cardinality that characterize bio-molecular data, by selecting subsets of randomized linear combinations of the input variables, and by using stability indices based on the overall distribution of similarity measures between multiple pairs of clusterings performed on the randomly projected data. A χ2-based statistical test is proposed to assess the significance of the clustering solutions and to detect significant and if possible multi-level structures simultaneously present in the data (e.g. hierarchical structures).ConclusionThe experimental results show that our model order selection methods are competitive with other state-of-the-art stability based algorithms and are able to detect multiple levels of structure underlying both synthetic and gene expression data.

Highlights

  • Cluster analysis has been widely applied for investigating structure in bio-molecular data

  • We present our approach to stability-based model order selection, considering randomized maps with bounded distortion to perturb the data, stability indices based on the distribution of the clustering similarity measures, and we present our χ2-based test for assessing the significance of the clustering solutions

  • At first, using synthetic data, we show that our proposed methods can detect the "correct" number of clusters, and multiple structures underlying the data

Read more

Summary

Introduction

Cluster analysis has been widely applied for investigating structure in bio-molecular data. In many cases we have no sufficient biological knowledge to "a priori" evaluate both the number of clusters (e.g. the number of biologically distinct tumor classes), as well as the validity of the discovered clusters (e.g. the reliability of new discovered tumor classes) [7]. To deal with these problems, several methods for assessing the validity of the discovered clusters and to test the existence of biologically meaningful clusters have been proposed (see [8] for a review)

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call