Abstract

There are two notoriously hard problems in cluster analysis, estimating the number of clusters, and checking whether the population to be clustered is not actually homogeneous. Given a dataset, a clustering method and a cluster validation index, this paper proposes to set up null models that capture structural features of the data that cannot be interpreted as indicating clustering. Artificial datasets are sampled from the null model with parameters estimated from the original dataset. This can be used for testing the null hypothesis of a homogeneous population against a clustering alternative. It can also be used to calibrate the validation index for estimating the number of clusters, by taking into account the expected distribution of the index under the null model for any given number of clusters. The approach is illustrated by three examples, involving various different clustering techniques (partitioning around medoids, hierarchical methods, a Gaussian mixture model), validation indexes (average silhouette width, prediction strength and BIC), and issues such as mixed type data, temporal and spatial autocorrelation.

Highlights

  • Cluster analysis is about finding groups of objects in data

  • The real dataset achieves this for Partitioning Around Medoids” (PAM) and k = 2 only, but this does not seem to be the best value compared to the null model, and prediction strength (PS) > 0.8 can be achieved by the null model for k = 2 for all clustering methods, and for PAM occasionally even for k = 3

  • Overall, taking into account the spatial autocorrelation of the unprocessed presence–absence data in this way changes the results quite a bit, compared to Sect. 5.3, and can explain the extent to which clustering is observed through the Bayesian information criterion (BIC)

Read more

Summary

Introduction

Cluster analysis is about finding groups of objects in data. Cluster analysis is a key area of data analysis with applications virtually everywhere where data arise. The validation index can be used as a test statistic for testing homogeneity against a clustering alternative (this yields a test for each candidate k for which the index is computed, which need to be aggregated to a single homogeneity test), and the simulated null distribution can be used to calibrate the validity index by comparing its value on the dataset against what is expected under the null model We argue that this is a better foundation for a decision about the number of clusters than the heuristics behind the standard recommendations in most of the literature. We propose using the parametric bootstrap to sample from null models that capture the non-clustering structure in the data for testing homogeneity against clustering, and for calibrating validity indexes.

The general setup
Clustering method and validation index
Non-clustering structure
Null model
Null model parameter estimation
Parametric bootstrap
Results
Results with plain Gaussian null model
Null model for spatial autocorrelation
Parametric bootstrap Repeat m times
Concluding discussion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call