Abstract
Clustering analysis is an important unsupervised learning technique in multivariate statistics and machine learning. In this paper, we propose a set of new mixture models called CLEMM (in short for Clustering with Envelope Mixture Models) that is based on the widely used Gaussian mixture model assumptions and the nascent research area of envelope methodology. Formulated mostly for regression models, envelope methodology aims for simultaneous dimension reduction and efficient parameter estimation, and includes a very recent formulation of envelope discriminant subspace for classification and discriminant analysis. Motivated by the envelope discriminant subspace pursuit in classification, we consider parsimonious probabilistic mixture models where the cluster analysis can be improved by projecting the data onto a latent lower-dimensional subspace. The proposed CLEMM framework and the associated envelope-EM algorithms thus provide foundations for envelope methods in unsupervised and semi-supervised learning problems. Numerical studies on simulated data and two benchmark data sets show significant improvement of our propose methods over the classical methods such as Gaussian mixture models, K-means and hierarchical clustering algorithms. An R package is available at https://github.com/kusakehan/CLEMM.
Highlights
Cluster analysis is a cornerstone of multivariate statistics and unsupervised learning
We focus on the Gaussian mixture models (GMM) because of their popularity and effectiveness in approximating multimodal distributions
To make our envelope-EM algorithm easier to comprehend, we present it in a way that is parallel to the classical EM algorithm for fitting GMM
Summary
Cluster analysis (or clustering) is a cornerstone of multivariate statistics and unsupervised learning. Our method is built on the envelope principle that is more flexible and general It assumes that there exists a subspace such that observations projected onto this subspace would share a common structure that is invariant as the underlying cluster varies. The envelope, on the other hand, is a more targeted dimension reduction subspace whose goal is to improve efficiency in Gaussian mixture model parameter estimation and to obtain better clustering result. Under this shared covariance assumption, the envelope-EM algorithm can be even faster than the standard EM estimation in Gaussian mixture models, which does not require subspace estimation.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have