Flexible parametric bootstrap for testing homogeneity against clustering and assessing the number of clusters

Christian Hennig,Chien-Ju Lin

doi:10.1007/s11222-015-9566-5

Christian Hennig, Chien-Ju Lin

Open Access

https://doi.org/10.1007/s11222-015-9566-5

Copy DOI

Abstract

There are two notoriously hard problems in cluster analysis, estimating the number of clusters, and checking whether the population to be clustered is not actually homogeneous. Given a dataset, a clustering method and a cluster validation index, this paper proposes to set up null models that capture structural features of the data that cannot be interpreted as indicating clustering. Artificial datasets are sampled from the null model with parameters estimated from the original dataset. This can be used for testing the null hypothesis of a homogeneous population against a clustering alternative. It can also be used to calibrate the validation index for estimating the number of clusters, by taking into account the expected distribution of the index under the null model for any given number of clusters. The approach is illustrated by three examples, involving various different clustering techniques (partitioning around medoids, hierarchical methods, a Gaussian mixture model), validation indexes (average silhouette width, prediction strength and BIC), and issues such as mixed type data, temporal and spatial autocorrelation.

Highlights

Cluster analysis is about finding groups of objects in data
The real dataset achieves this for Partitioning Around Medoids” (PAM) and k = 2 only, but this does not seem to be the best value compared to the null model, and prediction strength (PS) > 0.8 can be achieved by the null model for k = 2 for all clustering methods, and for PAM occasionally even for k = 3
Overall, taking into account the spatial autocorrelation of the unprocessed presence–absence data in this way changes the results quite a bit, compared to Sect. 5.3, and can explain the extent to which clustering is observed through the Bayesian information criterion (BIC)

Summary

Introduction

Cluster analysis is about finding groups of objects in data. Cluster analysis is a key area of data analysis with applications virtually everywhere where data arise. The validation index can be used as a test statistic for testing homogeneity against a clustering alternative (this yields a test for each candidate k for which the index is computed, which need to be aggregated to a single homogeneity test), and the simulated null distribution can be used to calibrate the validity index by comparing its value on the dataset against what is expected under the null model We argue that this is a better foundation for a decision about the number of clusters than the heuristics behind the standard recommendations in most of the literature. We propose using the parametric bootstrap to sample from null models that capture the non-clustering structure in the data for testing homogeneity against clustering, and for calibrating validity indexes.

The general setup

Clustering method and validation index

Non-clustering structure

Null model

Null model parameter estimation

Parametric bootstrap

Results

Results with plain Gaussian null model

Null model for spatial autocorrelation

Parametric bootstrap Repeat m times

Concluding discussion

Full Text

Published version (

Free)

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Statistics and Computing	Publication Date: Jun 11, 2015
Citations: 27	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Flexible parametric bootstrap for testing homogeneity against clustering and assessing the number of clusters

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Statistics and Computing

Lead the way for us

Similar Papers

Performance evaluation of some clustering algorithms and validity indices
S Bandyopadhyay ... U Maulik
IEEE Transactions on Pattern Analysis and Machine Intelligence | VOL. 24
S Bandyopadhyay, et. al.S Bandyopadhyay ... U Maulik
01 Dec 2002
IEEE Transactions on Pattern Analysis and Machine Intelligence | VOL. 24

Clustering of fMRI data: the elusive optimal number of clusters.
Mohamed L Seghier
PeerJ | VOL. 6
Mohamed L SeghierMohamed L Seghier
03 Oct 2018
PeerJ | VOL. 6

Improving Clustering and Cluster Validation with Missing Data Using Distance Estimation Methods
Tommi Kärkkäinen ... Marko Niemelä
-
Tommi Kärkkäinen, et. al.Tommi Kärkkäinen ... Marko Niemelä
20 Aug 2021
20 Aug 2021

Object-based cluster validation with densities
Behnam Tavakkol ... Jeongsub Choi
Pattern Recognition | VOL. 121
Behnam Tavakkol, et. al.Behnam Tavakkol ... Jeongsub Choi
04 Aug 2021
Pattern Recognition | VOL. 121

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Flexible parametric bootstrap for testing homogeneity against clustering and assessing the number of clusters

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Statistics and Computing