Abstract

UK Biobank is a large cohort study and faces missing data problems. The complexity of observations in this dataset, including noisy data, different missing rates, and the diversity of the data distributions and data types, results in the challenge of imputing missing data in the dataset. With the aim of addressing this issue and imputing missing values in UK Biobank, we propose an imputation framework based on prior knowledge and multiple imputation by chained equations (MICE), which consists of three parts: (1) Data cleaning for eliminating the interference of noisy data, (2) Correction of imputation illegibility for high missing rate subjects and low variance variables, and (3) MICE for imputing different types of independent variables. By comparing the imputation results of linear regression, linear regression with bootstrap, Bayesian linear regression, and random forest for continuous and categorical variables, we find that the best imputation model for continuous variables is linear regression, with the normalized mean absolute error (MAE) of 0.072+/-0.004 in the experiment of actual missing percentage, while the best imputation model for categorical variables is random forest, with the normalized MAE of 0.129+/-0.003. By comparing the imputation results of logistic regression, logistic regression with bootstrap, and random forest for binary variables, we find that the best imputation model is random forest, with the accuracy of 0.907+/-0.008 in the experiment of actual missing percentage. In addition, the data cleaning improves the imputation accuracy by 6.83% overall. The correction of high-missing rate variables is also a significant step, the imputation accuracy of all types of variables is 0.842+/- 0.006, 0.826+/-0.006, 0.793+/-0.005, 0.766+/-0.007, and 0.742+/- 0.006 when the missing percentage is 50%, 60%, 70%, 80%, and 90%, respectively.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call