A comparison of three clustering methods for finding subgroups in MRI, SMS or clinical data: SPSS TwoStep Cluster analysis, Latent Gold and SNOB.

Peter Kent,Rikke K Jensen,Alice Kongsted

doi:10.1186/1471-2288-14-113

Peter Kent, Rikke K Jensen + Show 1 more

Open Access

https://doi.org/10.1186/1471-2288-14-113

Copy DOI

Abstract

BackgroundThere are various methodological approaches to identifying clinically important subgroups and one method is to identify clusters of characteristics that differentiate people in cross-sectional and/or longitudinal data using Cluster Analysis (CA) or Latent Class Analysis (LCA). There is a scarcity of head-to-head comparisons that can inform the choice of which clustering method might be suitable for particular clinical datasets and research questions. Therefore, the aim of this study was to perform a head-to-head comparison of three commonly available methods (SPSS TwoStep CA, Latent Gold LCA and SNOB LCA).MethodsThe performance of these three methods was compared: (i) quantitatively using the number of subgroups detected, the classification probability of individuals into subgroups, the reproducibility of results, and (ii) qualitatively using subjective judgments about each program’s ease of use and interpretability of the presentation of results.We analysed five real datasets of varying complexity in a secondary analysis of data from other research projects. Three datasets contained only MRI findings (n = 2,060 to 20,810 vertebral disc levels), one dataset contained only pain intensity data collected for 52 weeks by text (SMS) messaging (n = 1,121 people), and the last dataset contained a range of clinical variables measured in low back pain patients (n = 543 people). Four artificial datasets (n = 1,000 each) containing subgroups of varying complexity were also analysed testing the ability of these clustering methods to detect subgroups and correctly classify individuals when subgroup membership was known.ResultsThe results from the real clinical datasets indicated that the number of subgroups detected varied, the certainty of classifying individuals into those subgroups varied, the findings had perfect reproducibility, some programs were easier to use and the interpretability of the presentation of their findings also varied. The results from the artificial datasets indicated that all three clustering methods showed a near-perfect ability to detect known subgroups and correctly classify individuals into those subgroups.ConclusionsOur subjective judgement was that Latent Gold offered the best balance of sensitivity to subgroups, ease of use and presentation of results with these datasets but we recognise that different clustering methods may suit other types of data and clinical research questions.

Highlights

There are various methodological approaches to identifying clinically important subgroups and one method is to identify clusters of characteristics that differentiate people in cross-sectional and/or longitudinal data using Cluster Analysis (CA) or Latent Class Analysis (LCA)
Clustering software This study investigated the use of three clustering methods, each implemented within a separate software program: (i) TwoStep Cluster Analysis in IBM SPSS, which is available in the base package of this program (TwoStep) [16], (ii) Latent Class Modeling in Latent Gold, which is the simplest of three LCA approaches available in this program (Latent Gold) [17], and (iii) ‘vanilla’ SNOB, which is the most straightforward form of this program (SNOB) [18,19,20]
The differences in the number of subgroups detected were typically smaller between Latent Gold and SNOB than between either of these and TwoStep, the Short Message Service (SMS) dataset was an exception to this observation

Summary

Introduction

There are various methodological approaches to identifying clinically important subgroups and one method is to identify clusters of characteristics that differentiate people in cross-sectional and/or longitudinal data using Cluster Analysis (CA) or Latent Class Analysis (LCA). There is increasing interest in the identification of clinically important patient subgroups in order to better target treatment, make more accurate estimates of prognosis, and improve health system efficiency by providing the right treatment to the right patient at the right time [1,2] This is especially so in non-specific health conditions that are highly prevalent, costly and have a high burden of disease. Other statistical methods seek to identify clusters of symptoms and signs that differentiate people, in cross-sectional and/or longitudinal data This approach was taken by Beneciuk et al [11], who used cluster analysis of baseline fear avoidance data from patients in a clinical trial and found three distinct subgroups (low risk, high specific fear, and high fear and catastrophising) that were associated with different clinical trajectories

Objectives

Methods

Results

Conclusion