Abstract

In the framework of Bayesian model-based clustering based on a finite mixture of Gaussian distributions, we present a joint approach to estimate the number of mixture components and identify cluster-relevant variables simultaneously as well as to obtain an identified model. Our approach consists in specifying sparse hierarchical priors on the mixture weights and component means. In a deliberately overfitting mixture model the sparse prior on the weights empties superfluous components during MCMC. A straightforward estimator for the true number of components is given by the most frequent number of non-empty components visited during MCMC sampling. Specifying a shrinkage prior, namely the normal gamma prior, on the component means leads to improved parameter estimates as well as identification of cluster-relevant variables. After estimating the mixture model using MCMC methods based on data augmentation and Gibbs sampling, an identified model is obtained by relabeling the MCMC output in the point process representation of the draws. This is performed using K-centroids cluster analysis based on the Mahalanobis distance. We evaluate our proposed strategy in a simulation setup with artificial data and by applying it to benchmark data sets.

Highlights

  • Finite mixture models provide a well-known probabilistic approach to model-based clustering

  • The main contribution of the present paper is to propose the use of sparse finite mixture models as an alternative to infinite mixtures in the context of model-based clustering

  • In the framework of Bayesian model-based clustering, we propose a straightforward and simple strategy for simultaneous estimation of the unknown number of mixture components, component-specific parameters, classification of observations, and identification of cluster-relevant variables for multivariate Gaussian mixtures

Read more

Summary

Introduction

Finite mixture models provide a well-known probabilistic approach to model-based clustering. Kim et al (2006), for instance, combine stochastic search for cluster-relevant variables with a Dirichlet process prior on the mixture weights to estimate the number of components. This proposal leads to a simple Bayesian framework where a straightforward MCMC sampling procedure is applied to jointly estimate the unknown number of mixture components, to determine cluster-relevant variables, and to perform componentspecific inference. To estimate the number of mixture components, we derive a point estimator from this distribution, typically, the posterior mode which is equal to the most frequent number of non-empty components visited during MCMC sampling This approach constitutes a simple and automatic strategy to estimate the unknown number of mixture components, without making use of model selection criteria, RJMCMC, or marginal likelihoods.

Model specification
Identifying the number of mixture components
Identifying cluster-relevant variables
Prior on the variance-covariance matrices
Bayesian estimation
MCMC sampling
On the relation between shrinkage in the weights and in the component means
Identifying sparse finite mixtures
Clustering the MCMC output in the point process representation
K -centroids clustering based on the Mahalanobis distance
Simulation study
Simulation setup with equal weights
Simulation setup with unequal weights
Crabs data
Iris data
Discussion
Findings
Sample hyperparameters:
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call