A robustness metric for biological data clustering algorithms

Yuping Lu,Michael A Langston,Charles A Phillips

doi:10.1186/s12859-019-3089-6

Yuping Lu, Michael A Langston + Show 1 more

Open Access

https://doi.org/10.1186/s12859-019-3089-6

Copy DOI

Journal: BMC Bioinformatics	Publication Date: Dec 1, 2019
Citations: 20	License type: open-access

Affiliation: University of Tennessee at Knoxville

Abstract

BackgroundCluster analysis is a core task in modern data-centric computation. Algorithmic choice is driven by factors such as data size and heterogeneity, the similarity measures employed, and the type of clusters sought. Familiarity and mere preference often play a significant role as well. Comparisons between clustering algorithms tend to focus on cluster quality. Such comparisons are complicated by the fact that algorithms often have multiple settings that can affect the clusters produced. Such a setting may represent, for example, a preset variable, a parameter of interest, or various sorts of initial assignments. A question of interest then is this: to what degree do the clusters produced vary as setting values change?ResultsThis work introduces a new metric, termed simply “robustness”, designed to answer that question. Robustness is an easily-interpretable measure of the propensity of a clustering algorithm to maintain output coherence over a range of settings. The robustness of eleven popular clustering algorithms is evaluated over some two dozen publicly available mRNA expression microarray datasets. Given their straightforwardness and predictability, hierarchical methods generally exhibited the highest robustness on most datasets. Of the more complex strategies, the paraclique algorithm yielded consistently higher robustness than other algorithms tested, approaching and even surpassing hierarchical methods on several datasets. Other techniques exhibited mixed robustness, with no clear distinction between them.ConclusionsRobustness provides a simple and intuitive measure of the stability and predictability of a clustering algorithm. It can be a useful tool to aid both in algorithm selection and in deciding how much effort to devote to parameter tuning.

Highlights

Cluster analysis is a core task in modern data-centric computation
To make the scope of this work manageable, and to keep comparisons as equitable as possible, we only consider algorithms that produce non-overlapping clusters, and that are unsupervised, in the sense that classes into which objects are clustered are not defined in advance. (We deviate from this very slightly in the case of Nearest Neighbor Networks (NNN) [18], which allows a pair of clusters to share a single element.) For each method considered we selected a range of settings commonly used in practice
In a previous comparison of genome-scale clustering algorithms [1], we focused on cluster enrichment, using Jaccard similarity with known Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) annotation sets as a measure of cluster quality

Summary

Introduction

Cluster analysis is a core task in modern data-centric computation. Algorithmic choice is driven by factors such as data size and heterogeneity, the similarity measures employed, and the type of clusters sought. Comparisons between clustering algorithms typically focus on the quality of clusters produced, as measured against either a known classification scheme or against some theoretical standards [1,2,3] In the former case, varying criteria for what constitutes a meritorious cluster are often applied, employing domain-specific knowledge such as ontological enrichment [4, 5], geographical alignment [6] or legacy delineation [7]. In the latter case, statistical quality metrics are most often used, with cluster density something of a gold standard. Additional metrics include the adjusted rand index [12], homogeneity [13], completeness [14], V-measure [15], and adjusted mutual

Methods

Results

Discussion

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

A robustness metric for biological data clustering algorithms

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Bioinformatics

Lead the way for us

Similar Papers

Comparing Clustering Algorithms and Their Influence on the Evolution of Labeled Clusters
Rene Schult
-
Rene SchultRene Schult
03 Sep 2007
03 Sep 2007

Effect of Fuzzy and Crisp Clustering Algorithms to Design Code Book for Vector Quantization in Applications
Yukinori Suzuki ... Junji Maeda
-
Yukinori Suzuki, et. al.Yukinori Suzuki ... Junji Maeda
01 Jan 2015
01 Jan 2015

Ensemble method for cluster number determination and algorithm selection in unsupervised learning
Antoine Zambelli
F1000Research | VOL. 11
Antoine ZambelliAntoine Zambelli
25 May 2022
F1000Research | VOL. 11

Empirical comparison of fast partitioning-based clustering algorithms for large data sets
Chih-Ping Wei ... Yen-Hsien Lee
Expert Systems with Applications | VOL. 24
Chih-Ping Wei, et. al.Chih-Ping Wei ... Yen-Hsien Lee
20 Jan 2003
Expert Systems with Applications | VOL. 24

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

A robustness metric for biological data clustering algorithms

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Bioinformatics