Abstract

In order to discover new subsets (clusters) of a data set, researchers often use algorithms that perform unsupervised clustering, namely, the algorithmic separation of a dataset into some number of distinct clusters. Deciding whether a particular separation (or number of clusters, K) is correct is a sort of ‘dark art’, with multiple techniques available for assessing the validity of unsupervised clustering algorithms. Here, we present a new technique for unsupervised clustering that uses multiple clustering algorithms, multiple validity metrics, and progressively bigger subsets of the data to produce an intuitive 3D map of cluster stability that can help determine the optimal number of clusters in a data set, a technique we call COmbined Mapping of Multiple clUsteriNg ALgorithms (COMMUNAL). COMMUNAL locally optimizes algorithms and validity measures for the data being used. We show its application to simulated data with a known K, and then apply this technique to several well-known cancer gene expression datasets, showing that COMMUNAL provides new insights into clustering behavior and stability in all tested cases. COMMUNAL is shown to be a useful tool for determining K in complex biological datasets, and is freely available as a package for R.

Highlights

  • In order to discover new subsets of a data set, researchers often use algorithms that perform unsupervised clustering, namely, the algorithmic separation of a dataset into some number of distinct clusters

  • A single subset of the data clustered according to a single clustering algorithm, and this is measured by a single cluster validity measure

  • We hypothesized that integrating information from multiple clustering algorithms and multiple validity measures would improve the signal:noise ratio and assist in identifying stable clusters

Read more

Summary

Introduction

In order to discover new subsets (clusters) of a data set, researchers often use algorithms that perform unsupervised clustering, namely, the algorithmic separation of a dataset into some number of distinct clusters. There can be significant disagreement over how many clusters truly exist, or what the ‘right’ number of clusters is; for instance, in glioblastoma multiforme gene expression data, different studies have come to different conclusions for K (from 2–4)[1,2,3] This disagreement arises when different clustering algorithms are used to separate the data, and when different clustering validity metrics are used to judge the ‘right’. For instance, takes multiple subsets of a dataset and uses repeated predictions of cluster assignment to gauge stability[5] Another method, HOPACH, recursively partitions a dataset while seeking to optimize some clustering measure[6]

Methods
Results
Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.