Abstract
BackgroundMissing values frequently arise in modern biomedical studies due to various reasons, including missing tests or complex profiling technologies for different omics measurements. Missing values can complicate the application of clustering algorithms, whose goals are to group points based on some similarity criterion. A common practice for dealing with missing values in the context of clustering is to first impute the missing values, and then apply the clustering algorithm on the completed data.ResultsWe consider missing values in the context of optimal clustering, which finds an optimal clustering operator with reference to an underlying random labeled point process (RLPP). We show how the missing-value problem fits neatly into the overall framework of optimal clustering by incorporating the missing value mechanism into the random labeled point process and then marginalizing out the missing-value process. In particular, we demonstrate the proposed framework for the Gaussian model with arbitrary covariance structures. Comprehensive experimental studies on both synthetic and real-world RNA-seq data show the superior performance of the proposed optimal clustering with missing values when compared to various clustering approaches.ConclusionOptimal clustering with missing values obviates the need for imputation-based pre-processing of the data, while at the same time possessing smaller clustering errors.
Highlights
Missing values frequently arise in modern biomedical studies due to various reasons, including missing tests or complex profiling technologies for different omics measurements
The performance of the proposed method for optimal clustering with missing values at random is compared with some suboptimal versions, two other methods for clustering data with missing values, and classical clustering algorithms with imputed missing values
The performance comparison is carried out on synthetic data generated from different Gaussian random labeled point process (RLPP) models with different missing probability setups, and on a publicly available dataset of breast cancer generated by The Cancer Genome Atlas (TCGA) Research Network
Summary
Missing values frequently arise in modern biomedical studies due to various reasons, including missing tests or complex profiling technologies for different omics measurements. Model-based clustering, which assumes that the data are generated by a finite mixture of underlying probability distributions, has gained popularity over heuristic clustering algorithms, for which there is no concrete way of determining the number of clusters or the best clustering method [3]. Model-based clustering methods [4] provide more robust criteria for selecting the appropriate number of clusters. In a Bayesian framework, utilizing Bayes Factor can incorporate both a priori knowledge of different models, and goodness of fit of the parametric model to the observed data. Nonparametric models such as Dirichlet-process mixture models [5] provide a more flexible approach for clustering, by automatically learning the number of components. In small-sample settings, model-based approaches that incorporate model uncertainty have proved successful in designing robust operators [6,7,8,9], and in objectivebased experiment design to expedite the discovery of such operators [10,11,12]
Published Version (Free)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.