Optimal clustering with missing values

Shahin Boluki,Edward R Dougherty,Xiaoning Qian,Siamak Zamani Dadaneh

doi:10.1186/s12859-019-2832-3

Abstract

BackgroundMissing values frequently arise in modern biomedical studies due to various reasons, including missing tests or complex profiling technologies for different omics measurements. Missing values can complicate the application of clustering algorithms, whose goals are to group points based on some similarity criterion. A common practice for dealing with missing values in the context of clustering is to first impute the missing values, and then apply the clustering algorithm on the completed data.ResultsWe consider missing values in the context of optimal clustering, which finds an optimal clustering operator with reference to an underlying random labeled point process (RLPP). We show how the missing-value problem fits neatly into the overall framework of optimal clustering by incorporating the missing value mechanism into the random labeled point process and then marginalizing out the missing-value process. In particular, we demonstrate the proposed framework for the Gaussian model with arbitrary covariance structures. Comprehensive experimental studies on both synthetic and real-world RNA-seq data show the superior performance of the proposed optimal clustering with missing values when compared to various clustering approaches.ConclusionOptimal clustering with missing values obviates the need for imputation-based pre-processing of the data, while at the same time possessing smaller clustering errors.

Highlights

Missing values frequently arise in modern biomedical studies due to various reasons, including missing tests or complex profiling technologies for different omics measurements
The performance of the proposed method for optimal clustering with missing values at random is compared with some suboptimal versions, two other methods for clustering data with missing values, and classical clustering algorithms with imputed missing values
The performance comparison is carried out on synthetic data generated from different Gaussian random labeled point process (RLPP) models with different missing probability setups, and on a publicly available dataset of breast cancer generated by The Cancer Genome Atlas (TCGA) Research Network

Summary

Introduction

Missing values frequently arise in modern biomedical studies due to various reasons, including missing tests or complex profiling technologies for different omics measurements. Model-based clustering, which assumes that the data are generated by a finite mixture of underlying probability distributions, has gained popularity over heuristic clustering algorithms, for which there is no concrete way of determining the number of clusters or the best clustering method [3]. Model-based clustering methods [4] provide more robust criteria for selecting the appropriate number of clusters. In a Bayesian framework, utilizing Bayes Factor can incorporate both a priori knowledge of different models, and goodness of fit of the parametric model to the observed data. Nonparametric models such as Dirichlet-process mixture models [5] provide a more flexible approach for clustering, by automatically learning the number of components. In small-sample settings, model-based approaches that incorporate model uncertainty have proved successful in designing robust operators [6,7,8,9], and in objectivebased experiment design to expedite the discovery of such operators [10,11,12]

Methods

Results

Conclusion