Head-to-head comparison of clustering methods for heterogeneous data: a simulation-driven benchmark

Gregoire Preud’Homme,Miguel Couceiro,Kevin Dalleau,Olivier Huttin,Malika Smaïl-Tabbone,Emmanuel Bresso,Faiez Zannad,Masatake Kobayashi,Kevin Duarte,Patrick Rossignol,Claire Lacomblez,João Pedro Ferreira,Nicolas Girerd,Marie-Dominique Devignes

doi:10.1038/s41598-021-83340-8

Abstract

The choice of the most appropriate unsupervised machine-learning method for “heterogeneous” or “mixed” data, i.e. with both continuous and categorical variables, can be challenging. Our aim was to examine the performance of various clustering strategies for mixed data using both simulated and real-life data. We conducted a benchmark analysis of “ready-to-use” tools in R comparing 4 model-based (Kamila algorithm, Latent Class Analysis, Latent Class Model [LCM] and Clustering by Mixture Modeling) and 5 distance/dissimilarity-based (Gower distance or Unsupervised Extra Trees dissimilarity followed by hierarchical clustering or Partitioning Around Medoids, K-prototypes) clustering methods. Clustering performances were assessed by Adjusted Rand Index (ARI) on 1000 generated virtual populations consisting of mixed variables using 7 scenarios with varying population sizes, number of clusters, number of continuous and categorical variables, proportions of relevant (non-noisy) variables and degree of variable relevance (low, mild, high). Clustering methods were then applied on the EPHESUS randomized clinical trial data (a heart failure trial evaluating the effect of eplerenone) allowing to illustrate the differences between different clustering techniques. The simulations revealed the dominance of K-prototypes, Kamila and LCM models over all other methods. Overall, methods using dissimilarity matrices in classical algorithms such as Partitioning Around Medoids and Hierarchical Clustering had a lower ARI compared to model-based methods in all scenarios. When applying clustering methods to a real-life clinical dataset, LCM showed promising results with regard to differences in (1) clinical profiles across clusters, (2) prognostic performance (highest C-index) and (3) identification of patient subgroups with substantial treatment benefit. The present findings suggest key differences in clustering performance between the tested algorithms (limited to tools readily available in R). In most of the tested scenarios, model-based methods (in particular the Kamila and LCM packages) and K-prototypes typically performed best in the setting of heterogeneous data.

Highlights

The choice of the most appropriate unsupervised machine-learning method for “heterogeneous” or “mixed” data, i.e. with both continuous and categorical variables, can be challenging
Clinical research usually relies on heterogeneous data: clinical datasets typically include a mix of variables related to clinical history, general/anthropometric data
In the first group of scenarios, virtual populations with available mixed variables were generated on which a benchmark of clustering techniques was conducted

Summary

Introduction

Objectives

Methods

Results

Discussion

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Scientific Reports	Publication Date: Feb 18, 2021
Citations: 49	License type: open-access

R Discovery Prime

R Discovery Prime

Head-to-head comparison of clustering methods for heterogeneous data: a simulation-driven benchmark

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Scientific Reports

Lead the way for us

Similar Papers

Mixture of latent trait analyzers for model-based clustering of categorical data
Isabella Gollini ... Thomas Brendan Murphy
Statistics and Computing | VOL. 24
Isabella Gollini, et. al.Isabella Gollini ... Thomas Brendan Murphy
03 Apr 2013
Statistics and Computing | VOL. 24

Development and validation of consensus clustering-based framework for brain segmentation using resting fMRI.
Srikanth Ryali ... Weidong Cai
Journal of neuroscience methods | VOL. 240
Srikanth Ryali, et. al.Srikanth Ryali ... Weidong Cai
29 Nov 2014
Journal of neuroscience methods | VOL. 240

Characterization of Patients Who Present With Insomnia: Is There Room for a Symptom Cluster-Based Approach?
Megan R Crawford ... James K Wyatt
Journal of Clinical Sleep Medicine | VOL. 13
Megan R Crawford, et. al.Megan R Crawford ... James K Wyatt
15 Jul 2017
Journal of Clinical Sleep Medicine | VOL. 13

Parallel Hierarchical Clustering in Linearithmic Time for Large-Scale Sequence Analysis
Qi Mao ... Yijun Sun
-
Qi Mao, et. al.Qi Mao ... Yijun Sun
01 Nov 2015
01 Nov 2015

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Head-to-head comparison of clustering methods for heterogeneous data: a simulation-driven benchmark

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Scientific Reports