Abstract

Model-based clustering is a popular approach for clustering multivariate data which has seen applications in numerous fields. Nowadays, high-dimensional data are more and more common and the model-based clustering approach has adapted to deal with the increasing dimensionality. In particular, the development of variable selection techniques has received a lot of attention and research effort in recent years. Even for small size problems, variable selection has been advocated to facilitate the interpretation of the clustering results. This review provides a summary of the methods developed for variable selection in model-based clustering. Existing R packages implementing the different methods are indicated and illustrated in application to two data analysis examples.

Highlights

  • Model-based clustering is a well established and popular tool for clustering multivariate data

  • Note that the above objective function is related to the log-likelihood of a Gaussian mixture model with a common isotropic covariance matrix of the form Σg = σ2I, an implicit assumption of the method. In this exposition we only considered the case associated to Gaussian mixture modeling and we adapted equation (10) of Witten and Tibshirani (2010) to the purpose, using the connection of k-means to the fitting of mixture of Gaussians with common spherical covariance matrix

  • After the description of the main variable selection methods for Gaussian mixture model (GMM), we present the different approaches for variable selection in latent class analysis

Read more

Summary

Introduction

Model-based clustering is a well established and popular tool for clustering multivariate data. In this approach, clustering is formulated in a modeling framework and the data generating process is represented through a finite mixture of probability distributions. Some variables may not possess any clustering information and are of no use in the detection of the group structure. Rather, they could be detrimental to the clustering. Even in situations of moderate or low dimensionality, reducing the set of variables employed in the clustering process can be beneficial (Fowlkes, Gnanadesikan and Kettenring, 1988)

Objectives
Methods
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call