Abstract

Clustering analysis is an important unsupervised learning technique in multivariate statistics and machine learning. In this paper, we propose a set of new mixture models called CLEMM (in short for Clustering with Envelope Mixture Models) that is based on the widely used Gaussian mixture model assumptions and the nascent research area of envelope methodology. Formulated mostly for regression models, envelope methodology aims for simultaneous dimension reduction and efficient parameter estimation, and includes a very recent formulation of envelope discriminant subspace for classification and discriminant analysis. Motivated by the envelope discriminant subspace pursuit in classification, we consider parsimonious probabilistic mixture models where the cluster analysis can be improved by projecting the data onto a latent lower-dimensional subspace. The proposed CLEMM framework and the associated envelope-EM algorithms thus provide foundations for envelope methods in unsupervised and semi-supervised learning problems. Numerical studies on simulated data and two benchmark data sets show significant improvement of our propose methods over the classical methods such as Gaussian mixture models, K-means and hierarchical clustering algorithms. An R package is available at https://github.com/kusakehan/CLEMM.

Highlights

  • Cluster analysis is a cornerstone of multivariate statistics and unsupervised learning

  • We focus on the Gaussian mixture models (GMM) because of their popularity and effectiveness in approximating multimodal distributions

  • To make our envelope-EM algorithm easier to comprehend, we present it in a way that is parallel to the classical EM algorithm for fitting GMM

Read more

Summary

Introduction

Cluster analysis (or clustering) is a cornerstone of multivariate statistics and unsupervised learning. Our method is built on the envelope principle that is more flexible and general It assumes that there exists a subspace such that observations projected onto this subspace would share a common structure that is invariant as the underlying cluster varies. The envelope, on the other hand, is a more targeted dimension reduction subspace whose goal is to improve efficiency in Gaussian mixture model parameter estimation and to obtain better clustering result. Under this shared covariance assumption, the envelope-EM algorithm can be even faster than the standard EM estimation in Gaussian mixture models, which does not require subspace estimation.

Notation and definitions
CLEMM: Clustering with Envelope Mixture Models
Envelope in clustering: a latent variable interpretation
Working mechanism of CLEMM
Connections with factor analyzers approaches
A brief review of the EM algorithm for GMM
Envelope-EM algorithm for CLEMM
3: M-step:
Envelope subspace estimation
CLEMM-Shared: a special case of CLEMM
Model selection
Simulations
Real data analysis
Discussion
EM implementation details
Computation time comparison
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call