Abstract

In model-based clustering, the Galaxy data set is often used as a benchmark data set to study the performance of different modeling approaches. Aitkin (Stat Model 1:287–304) compares maximum likelihood and Bayesian analyses of the Galaxy data set and expresses reservations about the Bayesian approach due to the fact that the prior assumptions imposed remain rather obscure while playing a major role in the results obtained and conclusions drawn. The aim of the paper is to address Aitkin’s concerns about the Bayesian approach by shedding light on how the specified priors influence the number of estimated clusters. We perform a sensitivity analysis of different prior specifications for the mixtures of finite mixture model, i.e., the mixture model where a prior on the number of components is included. We use an extensive set of different prior specifications in a full factorial design and assess their impact on the estimated number of clusters for the Galaxy data set. Results highlight the interaction effects of the prior specifications and provide insights into which prior specifications are recommended to obtain a sparse clustering solution. A simulation study with artificial data provides further empirical evidence to support the recommendations. A clear understanding of the impact of the prior specifications removes restraints preventing the use of Bayesian methods due to the complexity of selecting suitable priors. Also, the regularizing properties of the priors may be intentionally exploited to obtain a suitable clustering solution meeting prior expectations and needs of the application.

Highlights

  • This paper investigates the impact of different prior specifications on the results obtained in Bayesian cluster analysis based on mixture models

  • Aitkin (2001) compares maximum likelihood and Bayesian analyses of mixture models and expresses reservations about the Bayesian approach due to the fact that the prior assumptions imposed remain rather obscure while playing a major role in the results obtained and conclusions drawn

  • We investigate the prior on K+ induced by the prior specifications on K and γK considered for the Galaxy data set to further gauge our prior expectations of the influence of these prior specifications on the cluster solutions obtained

Read more

Summary

Introduction

This paper investigates the impact of different prior specifications on the results obtained in Bayesian cluster analysis based on mixture models. Mixture models may be used to either approximate arbitrary densities in a semi-parametric way or in a model-based clustering context to identify groups in the data. We will focus on the later application where each component is assumed to potentially represent a data cluster and the cluster distribution is not approximated by several mixture components. Hennig and Liao (2013) claim that “there are no unique ‘true’ or ‘best’ clusters in a data set” but that the prototypical shape of a cluster needs to be specified before this question can be answered. For clustering methods using mixture models, the prototypical shape of a cluster is in general specified by selecting the component-specific distributions. For the fitted mixture model, a one-to-one relationship between components and clusters is assumed. In the case of multivariate metric data one can specify isotropic Gaussian distributions as component distributions, where the variance is comparable across components, or Gaussian distributions with arbitrary variance-covariance matrices, which are allowed to considerably vary across components (see, for example, Fraley and Raftery 2002)

Objectives
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call