Abstract

BackgroundIt is challenging to deal with mixture models when missing values occur in clustering datasets.Methods and ResultsWe propose a dynamic clustering algorithm based on a multivariate Gaussian mixture model that efficiently imputes missing values to generate a “pseudo-complete” dataset. Parameters from different clusters and missing values are estimated according to the maximum likelihood implemented with an expectation-maximization algorithm, and multivariate individuals are clustered with Bayesian posterior probability. A simulation showed that our proposed method has a fast convergence speed and it accurately estimates missing values. Our proposed algorithm was further validated with Fisher’s Iris dataset, the Yeast Cell-cycle Gene-expression dataset, and the CIFAR-10 images dataset. The results indicate that our algorithm offers highly accurate clustering, comparable to that using a complete dataset without missing values. Furthermore, our algorithm resulted in a lower misjudgment rate than both clustering algorithms with missing data deleted and with missing-value imputation by mean replacement.ConclusionWe demonstrate that our missing-value imputation clustering algorithm is feasible and superior to both of these other clustering algorithms in certain situations.

Highlights

  • Clustering analysis, as a multivariate statistical method, refers to the process of classifying a set of observations into subsets, called clusters, such that observations in the same cluster are similar in certain respects [1,2,3]

  • We propose a dynamic clustering algorithm based on a multivariate Gaussian mixture model that efficiently imputes missing values to generate a “pseudo-complete” dataset

  • Parameters from different clusters and missing values are estimated according to the maximum likelihood implemented with an expectation-maximization algorithm, and multivariate individuals are clustered with Bayesian posterior probability

Read more

Summary

Introduction

Clustering analysis, as a multivariate statistical method, refers to the process of classifying a set of observations into subsets, called clusters, such that observations in the same cluster are similar in certain respects [1,2,3]. Clustering is widely used in medical sciences, for instance when clustering diseases or gene-expression profiles. Clustering methods usually fall into two PLOS ONE | DOI:10.1371/journal.pone.0161112. Performance Evaluation of Missing-Value Imputation Clustering design, data collection and analysis, decision to publish, or preparation of the manuscript. It is challenging to deal with mixture models when missing values occur in clustering datasets

Methods
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call