Missing value imputation in high-dimensional phenomic data: imputable or not, and how?

Serena G Liao,Frank C Sciurba,George C Tseng,Dongwan D Kang,Yan Lin,Jessica Bon,Naftali Kaminski,Divay Chandra

doi:10.1186/s12859-014-0346-6

Serena G Liao, Frank C Sciurba + Show 6 more

Open Access

https://doi.org/10.1186/s12859-014-0346-6

Copy DOI

Journal: BMC Bioinformatics	Publication Date: Nov 5, 2014
Citations: 138	License type: CC BY 4.0

Affiliation: University of Pittsburgh, Yale University

Abstract

BackgroundIn modern biomedical research of complex diseases, a large number of demographic and clinical variables, herein called phenomic data, are often collected and missing values (MVs) are inevitable in the data collection process. Since many downstream statistical and bioinformatics methods require complete data matrix, imputation is a common and practical solution. In high-throughput experiments such as microarray experiments, continuous intensities are measured and many mature missing value imputation methods have been developed and widely applied. Numerous methods for missing data imputation of microarray data have been developed. Large phenomic data, however, contain continuous, nominal, binary and ordinal data types, which void application of most methods. Though several methods have been developed in the past few years, not a single complete guideline is proposed with respect to phenomic missing data imputation.ResultsIn this paper, we investigated existing imputation methods for phenomic data, proposed a self-training selection (STS) scheme to select the best imputation method and provide a practical guideline for general applications. We introduced a novel concept of “imputability measure” (IM) to identify missing values that are fundamentally inadequate to impute. In addition, we also developed four variations of K-nearest-neighbor (KNN) methods and compared with two existing methods, multivariate imputation by chained equations (MICE) and missForest. The four variations are imputation by variables (KNN-V), by subjects (KNN-S), their weighted hybrid (KNN-H) and an adaptively weighted hybrid (KNN-A). We performed simulations and applied different imputation methods and the STS scheme to three lung disease phenomic datasets to evaluate the methods. An R package “phenomeImpute” is made publicly available.ConclusionsSimulations and applications to real datasets showed that MICE often did not perform well; KNN-A, KNN-H and random forest were among the top performers although no method universally performed the best. Imputation of missing values with low imputability measures increased imputation errors greatly and could potentially deteriorate downstream analyses. The STS scheme was accurate in selecting the optimal method by evaluating methods in a second layer of missingness simulation. All source files for the simulation and the real data analyses are available on the author’s publication website.Electronic supplementary materialThe online version of this article (doi:10.1186/s12859-014-0346-6) contains supplementary material, which is available to authorized users.

Highlights

In modern biomedical research of complex diseases, a large number of demographic and clinical variables, called phenomic data, are often collected and missing values (MVs) are inevitable in the data collection process
Simulation results We compared the performance of seven methods – mean imputation (MeanImp), KNN-V, KNN-S, KNN-H, KNN-A, missForest and multivariate imputation by chained equations (MICE) – on the three simulation scenarios described above
When implementing MICE, the R packages returned errors when the nominal or ordinal variables contained large number of levels and any level contained a small number of observations

Summary

Introduction

In modern biomedical research of complex diseases, a large number of demographic and clinical variables, called phenomic data, are often collected and missing values (MVs) are inevitable in the data collection process. Though several methods have been developed in the past few years, not a single complete guideline is proposed with respect to phenomic missing data imputation. In many studies of complex diseases, a large number of demographic, environmental and clinical variables are collected and missing values (MVs) are inevitable in the data collection process. The presence of missing values in clinical research reduces statistical power of the study and impedes the implementation of many statistical and bioinformatic methods that require a complete dataset (e.g. principal component analysis, clustering analysis, machine learning and graphical models). Many have pointed out that “missing value has the potential to undermine the validity of epidemiologic and clinical research and lead the conclusion to bias” [8]

Objectives

Methods

Results

Discussion

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Missing value imputation in high-dimensional phenomic data: imputable or not, and how?

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Bioinformatics

Lead the way for us

Similar Papers

Comparison of Single and MICE Imputation Methods for Missing Values: A Simulation Study
Nurul Azifah Mohd Pauzi ... Yap Bee Wah
Pertanika Journal of Science and Technology | VOL. 29
Nurul Azifah Mohd Pauzi, et. al.Nurul Azifah Mohd Pauzi ... Yap Bee Wah
30 Apr 2021
Pertanika Journal of Science and Technology | VOL. 29

A Comparison of Multiple Imputation Methods for Data with Missing Values
Geeta Chhabra ... Jayanthi Ranjan
Indian Journal of Science and Technology | VOL. 10
Geeta Chhabra, et. al.Geeta Chhabra ... Jayanthi Ranjan
18 May 2017
Indian Journal of Science and Technology | VOL. 10

Evaluating Imputation Methods for rainfall data under high variability in Johor River Basin, Malaysia
Zulfaqar Sa’Adi ... Mohamad Faizal Ahmad
Applied Computing and Geosciences | VOL. 20
Zulfaqar Sa’Adi, et. al.Zulfaqar Sa’Adi ... Mohamad Faizal Ahmad
01 Dec 2023
Applied Computing and Geosciences | VOL. 20

A new multivariate imputation method based on Bayesian networks
P Niloofar ... M Ganjali
Journal of Applied Statistics | VOL. 41
P Niloofar, et. al.P Niloofar ... M Ganjali
07 Oct 2013
Journal of Applied Statistics | VOL. 41

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Missing value imputation in high-dimensional phenomic data: imputable or not, and how?

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Bioinformatics