Abstract

We propose a parsimonious extension of the classical latent class model to cluster categorical data by relaxing the conditional independence assumption. Under this new mixture model, named conditional modes model (CMM), variables are grouped into conditionally independent blocks. Each block follows a parsimonious multinomial distribution where the few free parameters model the probabilities of the most likely levels, while the remaining probability mass is uniformly spread over the other levels of the block. Thus, when the conditional independence assumption holds, this model defines parsimonious versions of the standard latent class model. Moreover, when this assumption is violated, the proposed model brings out the main intra-class dependencies between variables, summarizing thus each class with relatively few characteristic levels. The model selection is carried out by an hybrid MCMC algorithm that does not require preliminary parameter estimation. Then, the maximum likelihood estimation is performed via an EM algorithm only for the best model. The model properties are illustrated on simulated data and on three real data sets by using the associated R package CoModes. The results show that this model allows to reduce biases involved by the conditional independence assumption while providing meaningful parameters.

Highlights

  • Clustering (Jajuga et al, 2002) is a worthwhile method for analysing complex data sets

  • This paper focuses on the cluster analysis of categorical variables which occurs in many different fields

  • The computing times of CIM is not provided since it is very low compared to Conditional Modes Model (CMM) (CIM is implemented in C++ for Rmixmod, CMM is only implemented in R)

Read more

Summary

Introduction

Clustering (Jajuga et al, 2002) is a worthwhile method for analysing complex data sets. This paper presents the conditional modes model (referred by CMM) which groups the variables into conditionally independent blocks in order to consider the main conditional dependencies Note that this idea was introduced in the Multimix software by Jorgensen and Hunt (1996) to cluster continuous and categorical data. The resulting model is a good challenger for CIM since it preserves the sparsity and avoids many biases by considering the main conditional correlations This model can be interpreted as a parsimonious version of the log-linear mixture model. Note that even if CMM can model some intra-class dependencies, it can assume the conditional independence between variables In this way, it generalizes lots of the parsimonious versions of CIM (Celeux and Govaert, 1991).

Mixture model of conditionally independent blocks
Block of multinomial distribution per modes
Model characteristics
Model selection by integrated likelihood
Model selection via an ideal Gibbs sampler
Gibbs algorithm for partition sampling
MCMC algorithm for model sampling
Maximum likelihood estimate
Challenge of the mode number selection
Results
Estimation accuracy with well specified model
Estimation accuracy with misspecified model
Applications
Alzheimer clustering
Seabirds clustering
Acute inflammations clustering
Conclusion
A Generic identifiability of CMM
New parametrization of the block distribution
Relation between embedded models
Integrated complete-data likelihood
C Details on the model sampling
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call