Abstract

Mixture models are a natural choice in many applications, but it can be difficult to place an a priori upper bound on the number of components. To circumvent this, investigators are turning increasingly to Dirichlet process mixture models (DPMMs). It is therefore important to develop an understanding of the strengths and weaknesses of this approach. This work considers the MAP (maximum a posteriori) clustering for the Gaussian DPMM (where the cluster means have Gaussian distribution and, for each cluster, the observations within the cluster have Gaussian distribution). Some desirable properties of the MAP partition are proved: ‘almost disjointness’ of the convex hulls of clusters (they may have at most one point in common) and (with natural assumptions) the comparability of sizes of those clusters that intersect any fixed ball with the number of observations (as the latter goes to infinity). Consequently, the number of such clusters remains bounded. Furthermore, if the data arises from independent identically distributed sampling from a given distribution with bounded support then the asymptotic MAP partition of the observation space maximises a function which has a straightforward expression, which depends only on the within-group covariance parameter. As the operator norm of this covariance parameter decreases, the number of clusters in the MAP partition becomes arbitrarily large, which may lead to the overestimation of the number of mixture components.

Highlights

  • 1.1 Motivation and new contributionsClustering is a central task in statistical data analysis

  • When there is not a natural a priori upper bound on the number of clusters, an increasingly popular technique to use is Dirichlet Process Mixture Models (DPMMs)

  • This section presents definitions of fundamental notions of our considerations together with some of their basic properties and relevant formulas. We show how they can be used to construct a statistical model in which we expect the data to be generated from different sources of randomness, without an a priori upper bound on the number of these sources a priori

Read more

Summary

Motivation and new contributions

Clustering is a central task in statistical data analysis. A Bayesian approach is to model data as coming from a random mixture of distributions and derive the posterior distribution on the space of possible divisions into clusters. Dahl (2006) suggests choosing the MAP estimator from a sample from the posterior He notes, a potential problem of this approach; there may be only a small difference in the posterior probability between two significantly different partitions. A potential problem of this approach; there may be only a small difference in the posterior probability between two significantly different partitions This may indicate that the classifier is giving the wrong answer as a consequence of mis-specification of the within-cluster covariance parameter. If the data is i.i.d. from an input distribution which is uniform over a ball of radius r in R2 and the within-cluster variance parameter is σ2I, for small σ, the classifier partitions the ball into several, seemingly arbitrary, convex sets.

Organisation of the article
The model
Results
Examples
Uniform distribution on an interval
Exponential distribution
Mixture of two normals
Uniform distribution on a disc
The MAP clustering properties
Classification of randomly generated data
The induced partition
Convergence of the MAP partitions
Discussion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call