Cross-Study Replicability in Cluster Analysis.

Lorenzo Masoero,Svitlana Tyekucheva,Lorenzo Trippa,Giovanni Parmigiani,Emma Thomas

doi:10.1214/22-sts871

Abstract

In cancer research, clustering techniques are widely used for exploratory analyses, playing a critical role in the identification of novel cancer subtypes and patient management. As data collected by multiple research groups grows, it is increasingly feasible to investigate the replicability of clustering procedures, that is, their ability to consistently recover biologically meaningful clusters across several data sets. In this paper, we review methods for replicability of clustering analyses, and discuss a novel framework for evaluating cross-study clustering replicability, useful when two or more studies are available. Our approach can be applied to any clustering algorithm and can employ different measures of similarity between partitions to quantify replicability, globally (i.e., for the whole sample) as well as locally (i.e., for individual clusters). Using experiments on synthetic and real gene expression data, we illustrate the usefulness of our procedure to evaluate if the same clusters are identified consistently across a collection of data sets.

Full Text