Accuracy of random-forest-based imputation of missing data in the presence of non-normality, non-linearity, and interaction

  • Abstract
  • Highlights & Summary
  • PDF
  • References
  • Citations
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon
Take notes icon Take Notes

BackgroundMissing data are common in statistical analyses, and imputation methods based on random forests (RF) are becoming popular for handling missing data especially in biomedical research. Unlike standard imputation approaches, RF-based imputation methods do not assume normality or require specification of parametric models. However, it is still inconclusive how they perform for non-normally distributed data or when there are non-linear relationships or interactions.MethodsTo examine the effects of these three factors, a variety of datasets were simulated with outcome-dependent missing at random (MAR) covariates, and the performances of the RF-based imputation methods missForest and CALIBERrfimpute were evaluated in comparison with predictive mean matching (PMM).ResultsBoth missForest and CALIBERrfimpute have high predictive accuracy but missForest can produce severely biased regression coefficient estimates and downward biased confidence interval coverages, especially for highly skewed variables in nonlinear models. CALIBERrfimpute typically outperforms missForest when estimating regression coefficients, although its biases are still substantial and can be worse than PMM for logistic regression relationships with interaction.ConclusionsRF-based imputation, in particular missForest, should not be indiscriminately recommended as a panacea for imputing missing data, especially when data are highly skewed and/or outcome-dependent MAR. A correct analysis requires a careful critique of the missing data mechanism and the inter-relationships between the variables in the data.

Highlights

  • Missing data are common in statistical analyses, and imputation methods based on random forests (RF) are becoming popular for handling missing data especially in biomedical research

  • In a comparison study done by Waljee et al [6], missForest was found to consistently produce the lowest imputation error compared with other imputation methods, including k-nearest neighbors (k-NN) imputation and “mice” [7], when data were missing completely at random (MCAR)

  • Bias of variable estimates When estimating the mean of X across the eight distributions (Fig. 2), missForest on average gave relative biases of 2.0, 1.3, 1.7, 1.4%, compared to 1.4, 2.5, 2.3, 1.7% in CALIBERrfimpute, 3.2, 1.4, 2.7, 5.3% in predictive mean matching (PMM) for scenarios 1 through 4, respectively. (To be concise, we report in the text the mean of the absolute values of the mean relative bias for each distribution when summarizing the relative bias across the eight distributions.) MissForest had the smallest bias except for scenario 1

Read more Highlights Expand/Collapse icon

Summary

IntroductionExpand/Collapse icon

Missing data are common in statistical analyses, and imputation methods based on random forests (RF) are becoming popular for handling missing data especially in biomedical research. RF-based imputation methods do not assume normality or require specification of parametric models. It is still inconclusive how they perform for non-normally distributed data or when there are non-linear relationships or interactions. Missing data are common in clinical and public health studies, and imputation methods based on machine learning algorithms, especially those based on random forest (RF) are gaining acceptance [1]. The differences between CALIBERrfimpute and missForest imputation on statistical analyses warrant further investigation

MethodsExpand/Collapse icon
ResultsExpand/Collapse icon
DiscussionExpand/Collapse icon
ConclusionExpand/Collapse icon
ReferencesShowing 9 of 17 papers
  • Open Access Icon
  • Cite Count Icon 248
  • 10.1093/ije/dyu080
What is the difference between missing completely at random and missing at random?
  • Apr 4, 2014
  • International Journal of Epidemiology
  • Krishnan Bhaskaran + 1 more

  • Open Access Icon
  • Cite Count Icon 554
  • 10.1002/sam.11348
Random Forest Missing Data Algorithms.
  • Jun 13, 2017
  • Statistical Analysis and Data Mining: The ASA Data Science Journal
  • Fei Tang + 1 more

  • Open Access Icon
  • Cite Count Icon 706
  • 10.1093/bioinformatics/btg287
A Bayesian missing value estimation method for gene expression profile data.
  • Nov 1, 2003
  • Bioinformatics
  • Shigeyuki Oba + 5 more

  • Open Access Icon
  • Cite Count Icon 8
  • 10.1080/00949655.2018.1530773
A simulation comparison of imputation methods for quantitative data in the presence of multiple data patterns
  • Oct 8, 2018
  • Journal of Statistical Computation and Simulation
  • N Solaro + 3 more

  • Open Access Icon
  • Cite Count Icon 4389
  • 10.1093/bioinformatics/btr597
MissForest—non-parametric missing value imputation for mixed-type data
  • Oct 28, 2011
  • Bioinformatics
  • Daniel J Stekhoven + 1 more

  • Open Access Icon
  • Cite Count Icon 8824
  • 10.18637/jss.v045.i03
Mice: Multivariate Imputation by Chained Equations inR
  • Jan 1, 2011
  • Journal of Statistical Software
  • Stef Van Buuren + 1 more

  • Open Access Icon
  • Cite Count Icon 584
  • 10.1093/aje/kwt312
Comparison of Random Forest and Parametric Imputation Models for Imputing Missing Data Using MICE: A CALIBER Study
  • Jan 12, 2014
  • American Journal of Epidemiology
  • Anoop D Shah + 4 more

  • Open Access Icon
  • Cite Count Icon 400
  • 10.1136/bmjopen-2013-002847
Comparison of imputation methods for missing laboratory data in medicine
  • Aug 1, 2013
  • BMJ Open
  • Akbar K Waljee + 8 more

  • Open Access Icon
  • PDF Download Icon
  • Cite Count Icon 35
  • 10.1038/s41467-018-04633-7
Dynamically prognosticating patients with hepatocellular carcinoma through survival paths mapping based on time-series data
  • Jun 8, 2018
  • Nature Communications
  • Lujun Shen + 12 more

CitationsShowing 10 of 190 papers
  • Research Article
  • Cite Count Icon 6
  • 10.1186/s12874-024-02392-2
A novel MissForest-based missing values imputation approach with recursive feature elimination in medical applications
  • Nov 8, 2024
  • BMC Medical Research Methodology
  • Ya-Han Hu + 3 more

BackgroundMissing values in datasets present significant challenges for data analysis, particularly in the medical field where data accuracy is crucial for patient diagnosis and treatment. Although MissForest (MF) has demonstrated efficacy in imputation research and recursive feature elimination (RFE) has proven effective in feature selection, the potential for enhancing MF through RFE integration remains unexplored.MethodsThis study introduces a novel imputation method, “recursive feature elimination-MissForest” (RFE-MF), designed to enhance imputation quality by reducing the impact of irrelevant features. A comparative analysis is conducted between RFE-MF and four classical imputation methods: mean/mode, k-nearest neighbors (kNN), multiple imputation by chained equations (MICE), and MF. The comparison is carried out across ten medical datasets containing both numerical and mixed data types. Different missing data rates, ranging from 10 to 50%, are evaluated under the missing completely at random (MCAR) mechanism. The performance of each method is assessed using two evaluation metrics: normalized root mean squared error (NRMSE) and predictive fidelity criterion (PFC). Additionally, paired samples t-tests are employed to analyze the statistical significance of differences among the outcomes.ResultsThe findings indicate that RFE-MF demonstrates superior performance across the majority of datasets when compared to four classical imputation methods (mean/mode, kNN, MICE, and MF). Notably, RFE-MF consistently outperforms the original MF, irrespective of variable type (numerical or categorical). Mean/mode imputation exhibits consistent performance across various scenarios. Conversely, the efficacy of kNN imputation fluctuates in relation to varying missing data rates.ConclusionThis study demonstrates that RFE-MF holds promise as an effective imputation method for medical datasets, providing a novel approach to addressing missing data challenges in medical applications.

  • New
  • Research Article
  • 10.3390/electronics14204108
A Novel Gating Adversarial Imputation Method for High-Fidelity Restoration of Missing Electrical Disturbance Data
  • Oct 20, 2025
  • Electronics
  • Lidan Chen + 2 more

The ongoing evolution of cyber-physical power systems renders them susceptible to frequent and multifaceted electrical disturbances. Critically, missingness resulting from cascading cyber-physical failures severely impedes the ability to accurately monitor and diagnose these electrical disturbances. To address this serious challenge, this paper proposes a novel gating adversarial imputation (GAI) framework specially tailored for the high-fidelity restoration of missing electrical disturbance data. The proposed GAI efficiently introduces the latest gating mechanism into a stability-improved adversarial imputation process, enabling robust feature representation while maintaining high imputation accuracy. To validate its efficacy, a synthetic dataset encompassing 15 distinct disturbance types is constructed based on precise mathematical equations and standard missingness. A comprehensive experimental evaluation demonstrates that the proposed GAI consistently outperforms five representative imputation benchmarks across all tested missing percentages. Moreover, GAI effectively preserves the original critical characteristics during data recovery, thereby enhancing accurate system monitoring and operational security.

  • Research Article
  • 10.1038/s41598-025-95490-0
Machine learning with hyperparameter optimization applied in facies-supported permeability modeling in carbonate oil reservoirs
  • Apr 15, 2025
  • Scientific Reports
  • Watheq J Al-Mudhafar + 3 more

Most carbonate reservoirs exhibit heterogeneous pore distribution, whereby the matrix displays low permeability, thus impeding the flow of oil. On the other hand, highly permeable fractures function as the main flow conduits within such reservoirs. Permeability measurements are obtained from core and well test analysis, which are too expensive and not available for many wells. Therefore, accurate permeability prediction is a vital step in developing an efficient field development plan, as it plays a pivotal role in the accurate distribution of 3D petrophysical properties throughout a reservoir. Machine learning (ML) algorithms are now widely applied to predict core permeability using conventional well logs to build a model for permeability prediction in uncored wells. This review considers the performance of six ML algorithms (LightGBM, CATBoost, XGBoost, Adaboost, random forest and gradient boosting) for permeability prediction from a high-quality dataset. The dataset incorporates multiple well-log inputs (gamma ray, caliper, density, neutron porosity, shallow and deep resistivity, total porosity, spontaneous potential, water saturation, depth, and facies) in addition to direct core permeability and porosity measurements. Data pre-processing techniques applied include missing data imputation, scale correction, normalization with three different transformations (log, Box-Cox, and NST) and outlier detection. To enhance the ML performance, two search algorithms (random search and Bayesian optimization) are compared in their ability to tune the ML hyperparameters. There is a need to identify a suitable parameter space, especially when the target variable range is changing. ML performance was evaluated with four evaluation metrics (RMSE, MAE, R2, and Adjusted R2). Results showed that the XGBoost algorithm with configuration of (RS as search algorithm, Box Cox as the normalization method, Z-score for outlier detection, without scale correction, old parameter space) delivered the best prediction performance for permeability with RMSE values of 6.9 md and 9.78 md for training and testing, respectively.

  • Research Article
  • Cite Count Icon 61
  • 10.1016/j.jhydrol.2021.126454
Automatic gap-filling of daily streamflow time series in data-scarce regions using a machine learning algorithm
  • May 14, 2021
  • Journal of Hydrology
  • Pedro Arriagada + 2 more

Automatic gap-filling of daily streamflow time series in data-scarce regions using a machine learning algorithm

  • Research Article
  • 10.1016/j.cjca.2025.09.017
Machine Learning Reveals How Depression Influences Chest Pain Localisation and Its Predictive Value for Coronary Artery Disease.
  • Sep 1, 2025
  • The Canadian journal of cardiology
  • Mohsyn Imran Malik + 4 more

Machine Learning Reveals How Depression Influences Chest Pain Localisation and Its Predictive Value for Coronary Artery Disease.

  • Open Access Icon
  • Research Article
  • Cite Count Icon 32
  • 10.7554/elife.70640
Early prediction of in-hospital death of COVID-19 patients: a machine-learning model based on age, blood analyses, and chest x-ray score.
  • Oct 18, 2021
  • eLife
  • Marika Vezzoli + 6 more

An early-warning model to predict in-hospital mortality on admission of COVID-19 patients at an emergency department (ED) was developed and validated using a machine-learning model. In total, 2782 patients were enrolled between March 2020 and December 2020, including 2106 patients (first wave) and 676 patients (second wave) in the COVID-19 outbreak in Italy. The first-wave patients were divided into two groups with 1474 patients used to train the model, and 632 to validate it. The 676 patients in the second wave were used to test the model. Age, 17 blood analytes, and Brescia chest X-ray score were the variables processed using a random forests classification algorithm to build and validate the model. Receiver operating characteristic (ROC) analysis was used to assess the model performances. A web-based death-risk calculator was implemented and integrated within the Laboratory Information System of the hospital. The final score was constructed by age (the most powerful predictor), blood analytes (the strongest predictors were lactate dehydrogenase, D-dimer, neutrophil/lymphocyte ratio, C-reactive protein, lymphocyte %, ferritin std, and monocyte %), and Brescia chest X-ray score (https://bdbiomed.shinyapps.io/covid19score/). The areas under the ROC curve obtained for the three groups (training, validating, and testing) were 0.98, 0.83, and 0.78, respectively. The model predicts in-hospital mortality on the basis of data that can be obtained in a short time, directly at the ED on admission. It functions as a web-based calculator, providing a risk score which is easy to interpret. It can be used in the triage process to support the decision on patient allocation.

  • Research Article
  • Cite Count Icon 1
  • 10.1016/j.trd.2024.104332
Using Multi-Source data to identify high NOx emitting Heavy-Duty diesel vehicles
  • Jul 25, 2024
  • Transportation Research Part D
  • Zhuoqian Yang + 3 more

Using Multi-Source data to identify high NOx emitting Heavy-Duty diesel vehicles

  • Research Article
  • 10.1186/s40537-024-01009-1
The application of adaptive group LASSO imputation method with missing values in personal income compositional data
  • Nov 19, 2024
  • Journal of Big Data
  • Ying Tian + 2 more

From social and economic perspectives, compositional data represent the proportions of various components within a whole, carrying non-negative values and providing only relative information. However, in many circumstances, there are often a significant number of missing values in datasets. Due to the complexity caused by these missing values, traditional estimation methods are ineffective. In this paper, an adaptive group LASSO-based imputation method is proposed for compositional data, consolidating the advantages of group LASSO and adaptive LASSO analysis techniques. Considering the impact of outliers on the accuracy of estimation, both simulation and case analysis are conducted to compare the proposed algorithm against four existing methods. The experimental results demonstrate that the proposed adaptive group LASSO method produces a better imputation performance at comparable missing rates.

  • Preprint Article
  • 10.21203/rs.3.rs-6974078/v1
The Longitudinal Association Between Health and Labour Market Participation: A Study of English Millennials
  • Jul 2, 2025
  • Alison Fang-Wei Wu + 3 more

Abstract Background: Young adults in England face increasing health and labour market challenges, yet little is known about how health throughout the life course is associated with adulthood employment. This study examines the longitudinal association between prior health problems and labour market participation at age 32, focusing on the timing and accumulated health disadvantages among English Millennials. Methods: Using the Next Steps study data, we focused on three health indicators: long-term illness, self-rated general health, and mental health, measured in childhood, adolescence, and early adulthood. Multinomial logistic regressions evaluated timing and accumulated health disadvantage, adjusting for gender, ethnicity and parental education and occupation during adolescence. Results: Poor health across all three life stages was consistently associated with increased risk of economic inactivity. In contrast, associations with unemployment were more selective, with health problems in early adulthood, but not in adolescence, remaining significant after accounting for earlier health issues. Accumulated exposure to all three health issues across life stages was also significantly associated with an increasing risk of unemployment and inactivity at age 32. Gender differences were observed: the association between poor health and later economic inactivity was generally stronger among men than women. However, for timing and accumulated mental health disadvantages, women showed a stronger link with unemployment. Conclusions: These findings emphasise the importance of adopting a life course perspective to understand the relationship between health and employment. Early support for health across dimensions during childhood and early adulthood could be essential for addressing later labour market inequalities and disengagement.

  • Open Access Icon
  • Research Article
  • Cite Count Icon 25
  • 10.1186/s12963-021-00274-z
Addressing missing values in routine health information system data: an evaluation of imputation methods using data from the Democratic Republic of the Congo during the COVID-19 pandemic
  • Nov 4, 2021
  • Population Health Metrics
  • Shuo Feng + 2 more

BackgroundPoor data quality is limiting the use of data sourced from routine health information systems (RHIS), especially in low- and middle-income countries. An important component of this data quality issue comes from missing values, where health facilities, for a variety of reasons, fail to report to the central system.MethodsUsing data from the health management information system in the Democratic Republic of the Congo and the advent of COVID-19 pandemic as an illustrative case study, we implemented seven commonly used imputation methods and evaluated their performance in terms of minimizing bias in imputed values and parameter estimates generated through subsequent analytical techniques, namely segmented regression, which is widely used in interrupted time series studies, and pre–post-comparisons through paired Wilcoxon rank-sum tests. We also examined the performance of these imputation methods under different missing mechanisms and tested their stability to changes in the data.ResultsFor regression analyses, there were no substantial differences found in the coefficient estimates generated from all methods except mean imputation and exclusion and interpolation when the data contained less than 20% missing values. However, as the missing proportion grew, k-NN started to produce biased estimates. Machine learning algorithms, i.e. missForest and k-NN, were also found to lack robustness to small changes in the data or consecutive missingness. On the other hand, multiple imputation methods generated the overall most unbiased estimates and were the most robust to all changes in data. They also produced smaller standard errors than single imputations. For pre–post-comparisons, all methods produced p values less than 0.01, regardless of the amount of missingness introduced, suggesting low sensitivity of Wilcoxon rank-sum tests to the imputation method used.ConclusionsWe recommend the use of multiple imputation in addressing missing values in RHIS datasets and appropriate handling of data structure to minimize imputation standard errors. In cases where necessary computing resources are unavailable for multiple imputation, one may consider seasonal decomposition as the next best method. Mean imputation and exclusion and interpolation, however, always produced biased and misleading results in the subsequent analyses, and thus, their use in the handling of missing values should be discouraged.

Similar Papers
  • PDF Download Icon
  • Research Article
  • Cite Count Icon 2
  • 10.3390/informatics10040077
A Machine Learning-Based Multiple Imputation Method for the Health and Aging Brain Study–Health Disparities
  • Oct 11, 2023
  • Informatics
  • Fan Zhang + 5 more

The Health and Aging Brain Study–Health Disparities (HABS–HD) project seeks to understand the biological, social, and environmental factors that impact brain aging among diverse communities. A common issue for HABS–HD is missing data. It is impossible to achieve accurate machine learning (ML) if data contain missing values. Therefore, developing a new imputation methodology has become an urgent task for HABS–HD. The three missing data assumptions, (1) missing completely at random (MCAR), (2) missing at random (MAR), and (3) missing not at random (MNAR), necessitate distinct imputation approaches for each mechanism of missingness. Several popular imputation methods, including listwise deletion, min, mean, predictive mean matching (PMM), classification and regression trees (CART), and missForest, may result in biased outcomes and reduced statistical power when applied to downstream analyses such as testing hypotheses related to clinical variables or utilizing machine learning to predict AD or MCI. Moreover, these commonly used imputation techniques can produce unreliable estimates of missing values if they do not account for the missingness mechanisms or if there is an inconsistency between the imputation method and the missing data mechanism in HABS–HD. Therefore, we proposed a three-step workflow to handle missing data in HABS–HD: (1) missing data evaluation, (2) imputation, and (3) imputation evaluation. First, we explored the missingness in HABS–HD. Then, we developed a machine learning-based multiple imputation method (MLMI) for imputing missing values. We built four ML-based imputation models (support vector machine (SVM), random forest (RF), extreme gradient boosting (XGB), and lasso and elastic-net regularized generalized linear model (GLMNET)) and adapted the four ML-based models to multiple imputations using the simple averaging method. Lastly, we evaluated and compared MLMI with other common methods. Our results showed that the three-step workflow worked well for handling missing values in HABS–HD and the ML-based multiple imputation method outperformed other common methods in terms of prediction performance and change in distribution and correlation. The choice of missing handling methodology has a significant impact on the accompanying statistical analyses of HABS–HD. The conceptual three-step workflow and the ML-based multiple imputation method perform well for our Alzheimer’s disease models. They can also be applied to other disease data analyses.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 212
  • 10.1186/1471-2288-10-7
Comparison of techniques for handling missing covariate data within prognostic modelling studies: a simulation study
  • Jan 19, 2010
  • BMC Medical Research Methodology
  • Andrea Marshall + 3 more

BackgroundThere is no consensus on the most appropriate approach to handle missing covariate data within prognostic modelling studies. Therefore a simulation study was performed to assess the effects of different missing data techniques on the performance of a prognostic model.MethodsDatasets were generated to resemble the skewed distributions seen in a motivating breast cancer example. Multivariate missing data were imposed on four covariates using four different mechanisms; missing completely at random (MCAR), missing at random (MAR), missing not at random (MNAR) and a combination of all three mechanisms. Five amounts of incomplete cases from 5% to 75% were considered. Complete case analysis (CC), single imputation (SI) and five multiple imputation (MI) techniques available within the R statistical software were investigated: a) data augmentation (DA) approach assuming a multivariate normal distribution, b) DA assuming a general location model, c) regression switching imputation, d) regression switching with predictive mean matching (MICE-PMM) and e) flexible additive imputation models. A Cox proportional hazards model was fitted and appropriate estimates for the regression coefficients and model performance measures were obtained.ResultsPerforming a CC analysis produced unbiased regression estimates, but inflated standard errors, which affected the significance of the covariates in the model with 25% or more missingness. Using SI, underestimated the variability; resulting in poor coverage even with 10% missingness. Of the MI approaches, applying MICE-PMM produced, in general, the least biased estimates and better coverage for the incomplete covariates and better model performance for all mechanisms. However, this MI approach still produced biased regression coefficient estimates for the incomplete skewed continuous covariates when 50% or more cases had missing data imposed with a MCAR, MAR or combined mechanism. When the missingness depended on the incomplete covariates, i.e. MNAR, estimates were biased with more than 10% incomplete cases for all MI approaches.ConclusionThe results from this simulation study suggest that performing MICE-PMM may be the preferred MI approach provided that less than 50% of the cases have missing data and the missing data are not MNAR.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 12
  • 10.1002/ajpa.24614
Missing data in bioarchaeology II: A test of ordinal and continuous data imputation.
  • Sep 12, 2022
  • American journal of biological anthropology
  • Amanda Wissler + 2 more

Previous research has shown that while missing data are common in bioarchaeological studies, they are seldom handled using statistically rigorous methods. The primary objective of this article is to evaluate the ability of imputation to manage missing data and encourage the use of advanced statistical methods in bioarchaeology and paleopathology. An overview of missing data management in biological anthropology is provided, followed by a test of imputation and deletion methods for handling missing data. Missing data were simulated on complete datasets of ordinal (n=287) and continuous (n=369) bioarchaeological data. Missing values were imputed using five imputation methods (mean, predictive mean matching, random forest, expectation maximization, and stochastic regression) and the success of each at obtaining the parameters of the original dataset compared with pairwise and listwise deletion. In all instances, listwise deletion was least successful at approximating the original parameters. Imputation of continuous data was more effective than ordinal data. Overall, no one method performed best and the amount of missing data proved a stronger predictor of imputation success. These findings support the use of imputation methods over deletion for handling missing bioarchaeological and paleopathology data, especially when the data are continuous. Whereas deletion methods reduce sample size, imputation maintains sample size, improving statistical power and preventing bias from being introduced into the dataset.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 2
  • 10.1186/s13040-022-00316-8
Classification of breast cancer recurrence based on imputed data: a simulation study
  • Dec 7, 2022
  • BioData Mining
  • Rahibu A Abassi + 1 more

Several studies have been conducted to classify various real life events but few are in medical fields; particularly about breast recurrence under statistical techniques. To our knowledge, there is no reported comparison of statistical classification accuracy and classifiers’ discriminative ability on breast cancer recurrence in presence of imputed missing data. Therefore, this article aims to fill this analysis gap by comparing the performance of binary classifiers (logistic regression, linear and quadratic discriminant analysis) using several datasets resulted from imputation process using various simulation conditions. Our study aids the knowledge about how classifiers’ accuracy and discriminative ability in classifying a binary outcome variable are affected by the presence of imputed numerical missing data. We simulated incomplete datasets with 15, 30, 45 and 60% of missingness under Missing At Random (MAR) and Missing Completely At Random (MCAR) mechanisms. Mean imputation, hot deck, k-nearest neighbour, multiple imputations via chained equation, expected-maximisation, and predictive mean matching were used to impute incomplete datasets. For each classifier, correct classification accuracy and area under the Receiver Operating Characteristic (ROC) curves under MAR and MCAR mechanisms were compared. The linear discriminant classifier attained the highest classification accuracy (73.9%) based on mean-imputed data at 45% of missing data under MCAR mechanism. As a classifier, the logistic regression based on predictive mean matching imputed-data yields the greatest areas under ROC curves (0.6418) at 30% missingness while k-nearest neighbour tops the value (0.6428) at 60% of missing data under MCAR mechanism.

  • Research Article
  • Cite Count Icon 4
  • 10.1016/j.eswa.2023.122775
Efficient imputation of missing data using the information of local space defined by the geometric one-class classifier
  • Nov 30, 2023
  • Expert Systems with Applications
  • Do Gyun Kim + 1 more

Efficient imputation of missing data using the information of local space defined by the geometric one-class classifier

  • Research Article
  • 10.47974/jsms-971
Comparing the performance of eight imputation methods for propensity score matching in missing data problem
  • Jan 1, 2023
  • Journal of Statistics and Management Systems
  • Imran Kurt Omurlu + 2 more

Propensity score (PS) is a popular method to control for covariates in observational studies. A challenge in PS analyses is missing values in covariates. This study aims to investigate how different imputation methods of handling missing values of covariates in a PS analysis can affect average treatment on the treated (ATT) estimates. In this study, missing data imputation methods were evaluated using different data sets, whose covariates were low, medium, and high (r=0.10, 0.50, 0.85) correlated with each other, for n=200 units and 1000 times running simulation. Missing data structures were created according to the missing at random (MAR) mechanism and different missing rates. Different datasets were obtained after having imputed the missing values separately by eight imputation methods including mean, median, mode, hot deck, last observation carried forward (LOCF), next observation carried backward (NOCB), regression and predictive mean matching (PMM). Then the PS nearest neighbor matching was implemented and ATT scores were obtained using the imputed data sets. The predictive performance of imputation methods was compared according to ATT scores by hierarchical cluster analysis with Euclidean distance complete linkage. ATT scores of regression and PMM methods were closer to each other and these methods showed the best predictive performance. Additionally, when there were larger amounts of missing data, the PMM was the best method of choice. Ignoring missing values on covariates for PS analyses causes information loss significantly and this information loss becomes greater as the rate of missing data increases. PS analyses might be biased if missing data on covariates are also ignored. To prevent this information loss and bias, PS analyses should be performed after solving the problem of missing data with MAR mechanism on covariates by regression and PMM methods, which showed statistical superiority compared to other methods in this study.

  • PDF Download Icon
  • Research Article
  • 10.31579/2642-9756/118
Imputation methods on retrospective breast cancer data in Tanzania: A comparative study
  • Jun 6, 2022
  • Women Health Care and Issues
  • Rahibu A Abassi + 2 more

Background: Clinical datasets are at risk of having missing data for several reasons including patients’ failure to attend clinical measurements and measurement recorder’s defects. Missing data can significantly affect the analysis and results might be doubtful due to bias caused by omission incomplete records during analysis especially if a dataset is small. This study aims to compare several imputation methods in terms of efficiency in filling-in missing data so as to increase prediction and classification accuracy in breast cancer dataset. Methodology: Five imputation methods namely series mean, k-nearest neighbour, hot deck, predictive mean matching, expected maximisation via bootstrapping, and multiple imputation by chained equations were applied to replace the missing values to the real breast cancer dataset. The efficiency of imputation methods was compared by using the Root Mean Square Errors and Mean Absolute Errors to obtain a suitable complete dataset. Binary logistic regression and linear discrimination classifiers were applied to the imputed dataset to compare their efficacy on classification and discrimination. Results: The evaluation of imputation methods revealed that the predictive mean matching method was better off compared to other imputation methods. In addition, the binary logistic regression and linear discriminant analyses yield almost similar values on overall classification rates, sensitivity and specificity. Conclusion: The predictive mean matching imputation showed higher accuracy in estimating and replacing missing data values in a real breast cancer dataset under the study. It is a more effective and good approach to handle missing data. We recommend replacing missing data by using predictive mean matching since it is a plausible approach toward multiple imputations for numerical variables. It improves estimation and prediction accuracy over the use complete-case analysis especially when percentage of missing data is not very small.

  • Research Article
  • 10.31767/su.4(91)2020.04.02
Software Implementation of Missing Data Recovery: Comparative Analysis
  • Dec 16, 2020
  • Statistics of Ukraine
  • N V Kovtun + 1 more

The paper contains a comparative analysis of the possibilities of using different software products to solve the problem of missing data on the example of the sample for which different variants of data skips are simulated. The study provided an opportunity to identify the strengths and weaknesses of these software products, as well as to determine the effectiveness of a particular method for different amounts of missed information. Thus, the easiest way to handle the situation with missing data is Statistica, but there are offered only simple methods of processing data with missing values in Statistica. So, this program will help to cope with the missed data when there is a small number of omissions (up to 10%). SPSS offers a wider range of data imputation methods than Statistica, and at the same time it offers a more user-friendly interface compared to the R or SAS programming language. In the R and SAS software environments, you can use different methods of missing data imputation from the simplest to the most complex, such as, for example, multiple imputation. Thus, R and SAS are the most powerful missing data recovery programs, but they are more complex for users because they require knowledge of the programming language. It is found out that none of the mentioned software-analytical environments has built-in procedures for processing categorical data with missing values. There are approaches that can be implemented by analogy for ordered categories in R and SAS software environments, but it does not cover all the needs of the analysis of research, which are implemented in the form of surveys with the results that are mostly presented as answers. The methods used to impute quantitative data cannot be applied to categorical data, even if numbers are used to encode responses. The study undoubtedly proved that handling the missing data, as well as the choosing of possible ways to use certain methods of data imputation in different software environments should be approached very carefully and the problem of imputation should be solved in each case based on careful analysis of the existing database, considering not only the characteristics of the data and the number of gaps, but also the specific of a particular study. Dealing with missing data involves a wide range of the issues, which includes both the exploration of the nature of gaps, the methodology for data processing and imputation, depending not only on their nature but also on the type and the use of various software environments on missing data imputation. It is planned in future research to assess the effectiveness of the recoverability of imputation methods in different software environments, as well as to develop methodological principles for restoring gaps for categorical data and implement them into practice.

  • Book Chapter
  • Cite Count Icon 2
  • 10.1007/978-3-030-49536-7_8
Imputing Missing Values: Reinforcement Bayesian Regression and Random Forest
  • Jun 30, 2020
  • Shahriar Shakir Sumit + 4 more

Imputing missing data plays a pivotal role in minimizing the biases of knowledge in computational data. The principal purpose of this paper is to establish a better approach to dealing with missing data. Clinical data often contain erroneous data, which cause major drawbacks for analysis. In this paper, we present a new dynamic approach for managing missing data in biomedical databases in order to improve overall modeling accuracy. We propose a reinforcement Bayesian regression model. Furthermore; we compare the Bayesian Regression and the random forest dynamically under a reinforcement approach to minimize the ambiguity of knowledge. Our result indicates that the imputation method of random forest scores better than the Bayesian regression in several cases. At best the reinforcement Bayesian regression scores over 85% under range condition of 5% missing data. The reinforcement Bayesian regression performs over 70% accuracy for imputing missing medical data in overall condition. However; the proposed reinforcement Bayesian regression models imputed missing data on over 70% cases are exactly identical to the missing value, which is remarkably making the advantage of the study. This approach significantly improves the accuracy of imputing missing data for clinical research.

  • Research Article
  • Cite Count Icon 15
  • 10.1155/2021/6668822
Evaluation of Four Multiple Imputation Methods for Handling Missing Binary Outcome Data in the Presence of an Interaction between a Dummy and a Continuous Variable
  • May 17, 2021
  • Journal of Probability and Statistics
  • Sara Javadi + 4 more

Multiple imputation by chained equations (MICE) is the most common method for imputing missing data. In the MICE algorithm, imputation can be performed using a variety of parametric and nonparametric methods. The default setting in the implementation of MICE is for imputation models to include variables as linear terms only with no interactions, but omission of interaction terms may lead to biased results. It is investigated, using simulated and real datasets, whether recursive partitioning creates appropriate variability between imputations and unbiased parameter estimates with appropriate confidence intervals. We compared four multiple imputation (MI) methods on a real and a simulated dataset. MI methods included using predictive mean matching with an interaction term in the imputation model in MICE (MICE-interaction), classification and regression tree (CART) for specifying the imputation model in MICE (MICE-CART), the implementation of random forest (RF) in MICE (MICE-RF), and MICE-Stratified method. We first selected secondary data and devised an experimental design that consisted of 40 scenarios (2 × 5 × 4), which differed by the rate of simulated missing data (10%, 20%, 30%, 40%, and 50%), the missing mechanism (MAR and MCAR), and imputation method (MICE-Interaction, MICE-CART, MICE-RF, and MICE-Stratified). First, we randomly drew 700 observations with replacement 300 times, and then the missing data were created. The evaluation was based on raw bias (RB) as well as five other measurements that were averaged over the repetitions. Next, in a simulation study, we generated data 1000 times with a sample size of 700. Then, we created missing data for each dataset once. For all scenarios, the same criteria were used as for real data to evaluate the performance of methods in the simulation study. It is concluded that, when there is an interaction effect between a dummy and a continuous predictor, substantial gains are possible by using recursive partitioning for imputation compared to parametric methods, and also, the MICE-Interaction method is always more efficient and convenient to preserve interaction effects than the other methods.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 15
  • 10.3390/w15081519
Comparing Single and Multiple Imputation Approaches for Missing Values in Univariate and Multivariate Water Level Data
  • Apr 13, 2023
  • Water
  • Nura Umar + 1 more

Missing values in water level data is a persistent problem in data modelling and especially common in developing countries. Data imputation has received considerable research attention, to raise the quality of data in the study of extreme events such as flooding and droughts. This article evaluates single and multiple imputation methods used on monthly univariate and multivariate water level data from four water stations on the rivers Benue and Niger in Nigeria. The missing completely at random, missing at random and missing not at random data mechanisms were each considered. The best imputation method is identified using two error metrics: root mean square error and mean absolute percentage error. For the univariate case, the seasonal decomposition method is best for imputing missing values at various missingness levels for all three missing mechanisms, followed by Kalman smoothing, while random imputation is much poorer. For instance, for 5% missing data for the Kainji water station, missing completely at random, the Kalman smoothing, random and seasonal decomposition methods had average root mean square errors of 13.61, 102.60 and 10.46, respectively. For the multivariate case, missForest is best, closely followed by k nearest neighbour for the missing completely at random and missing at random mechanisms, and k nearest neighbour is best, followed by missForest, for the missing not at random mechanism. The random forest and predictive mean matching methods perform poorly in terms of the two metrics considered. For example, for 10% missing data missing completely at random for the Ibi water station, the average root mean square errors for random forest, k nearest neighbour, missForest and predictive mean matching were 22.51, 17.17, 14.60 and 25.98, respectively. The results indicate that the seasonal decomposition method, and missForest or k nearest neighbour methods, can impute univariate and multivariate water level missing data, respectively, with higher accuracy than the other methods considered.

  • Research Article
  • Cite Count Icon 128
  • 10.1186/1471-2288-10-112
Comparison of imputation methods for handling missing covariate data when fitting a Cox proportional hazards model: a resampling study.
  • Dec 1, 2010
  • BMC Medical Research Methodology
  • Andrea Marshall + 2 more

BackgroundThe appropriate handling of missing covariate data in prognostic modelling studies is yet to be conclusively determined. A resampling study was performed to investigate the effects of different missing data methods on the performance of a prognostic model.MethodsObserved data for 1000 cases were sampled with replacement from a large complete dataset of 7507 patients to obtain 500 replications. Five levels of missingness (ranging from 5% to 75%) were imposed on three covariates using a missing at random (MAR) mechanism. Five missing data methods were applied; a) complete case analysis (CC) b) single imputation using regression switching with predictive mean matching (SI), c) multiple imputation using regression switching imputation, d) multiple imputation using regression switching with predictive mean matching (MICE-PMM) and e) multiple imputation using flexible additive imputation models. A Cox proportional hazards model was fitted to each dataset and estimates for the regression coefficients and model performance measures obtained.ResultsCC produced biased regression coefficient estimates and inflated standard errors (SEs) with 25% or more missingness. The underestimated SE after SI resulted in poor coverage with 25% or more missingness. Of the MI approaches investigated, MI using MICE-PMM produced the least biased estimates and better model performance measures. However, this MI approach still produced biased regression coefficient estimates with 75% missingness.ConclusionsVery few differences were seen between the results from all missing data approaches with 5% missingness. However, performing MI using MICE-PMM may be the preferred missing data approach for handling between 10% and 50% MAR missingness.

  • Research Article
  • Cite Count Icon 69
  • 10.1016/j.jss.2006.05.003
A new imputation method for small software project data sets
  • Jun 16, 2006
  • Journal of Systems and Software
  • Qinbao Song + 1 more

A new imputation method for small software project data sets

  • Research Article
  • Cite Count Icon 9
  • 10.1016/j.aaf.2021.12.013
Assessing methods for multiple imputation of systematic missing data in marine fisheries time series with a new validation algorithm
  • Feb 18, 2022
  • Aquaculture and Fisheries
  • Iván F Benavides + 4 more

Assessing methods for multiple imputation of systematic missing data in marine fisheries time series with a new validation algorithm

  • Research Article
  • 10.3999/jscpt.45.135
<b>3. インダストリーの立場から</b>
  • Jan 1, 2014
  • Rinsho yakuri/Japanese Journal of Clinical Pharmacology and Therapeutics
  • Osamu Togo

In clinical trials involving repeated measures, missing data may occur for various reasons. Missing data not only cause loss of information of clinical trials, but are also a potential source of bias and loss of precision of the evaluation results, subsequently may lead to erroneous conclusions. Therefore, efforts to minimize the generation of missing data are necessary when planning to conduct a study. Furthermore, missing data imputation and statistical methods for handling missing data are being examined and discussed in academic societies. From the side of regulatory authorities, the European Medicines Agency (EMA) published the “Guidelines on Missing Data in Confirmatory Clinical Trials” in 2009, which describes the treatment and interpretation of missing data. This article shows our experience of imputation of missing data using modified total sharp score (mTSS), which is an index of joint damage progression, in a clinical trial on patients with rheumatoid arthritis. Some imputation methods and statistical methods are applied to dummy data that are based on the actual clinical trial, and the results are compared and discussed.

More from: BMC Medical Research Methodology
  • New
  • Research Article
  • 10.1186/s12874-025-02690-3
Assessing the quality for integrated guidelines: systematic comparison between the AGREE Ⅱ and AGREE-HS tools.
  • Nov 6, 2025
  • BMC medical research methodology
  • Gezhi Zhang + 7 more

  • New
  • Research Article
  • 10.1186/s12874-025-02706-y
A methodology for developing dermatological datasets: lessons from retrospective data collection for AI-based applications.
  • Nov 5, 2025
  • BMC medical research methodology
  • Alma Pedro + 10 more

  • New
  • Research Article
  • 10.1186/s12874-025-02689-w
Assessing the accuracy of survival machine learning and traditional statistical models for Alzheimer's disease prediction over time: a study on the ADNI cohort
  • Nov 5, 2025
  • BMC Medical Research Methodology
  • Sardar Jahani + 2 more

  • Research Article
  • 10.1186/s12874-025-02670-7
Current practice on covariate adjustment and stratified analysis —based on survey results by ASA oncology estimand working group conditional and marginal effect task force
  • Nov 4, 2025
  • BMC Medical Research Methodology
  • Jiawei Wei + 7 more

  • Supplementary Content
  • 10.1186/s12874-025-02683-2
Enhancing confidence in complex health technology assessments by using real-world evidence: highlighting existing strategies for effective drug evaluation
  • Nov 3, 2025
  • BMC Medical Research Methodology
  • Alison Antoine + 4 more

  • Discussion
  • 10.1186/s12874-025-02700-4
The importance of considering variability in re-expression of effect estimates for use in meta-analyses
  • Oct 30, 2025
  • BMC Medical Research Methodology
  • Leonid Kopylev + 1 more

  • Discussion
  • 10.1186/s12874-025-02699-8
Response to “The importance of considering variability in re-expression of effect estimates for use in meta-analysis.” (Kopylev and Dzierlenga 2025)
  • Oct 30, 2025
  • BMC Medical Research Methodology
  • Matthew W Linakis + 1 more

  • Research Article
  • 10.1186/s12874-025-02685-0
Comparing in-person and remote consent of people with dementia into a primary care-based cluster randomised controlled trial: lessons from the Dementia PersonAlised Care Team (D-PACT) feasibility study.
  • Oct 30, 2025
  • BMC medical research methodology
  • T M Oh + 19 more

  • Research Article
  • 10.1186/s12874-025-02696-x
Identifying delayed human response to external risks: an econometric analysis of mobility change during a pandemic
  • Oct 29, 2025
  • BMC Medical Research Methodology
  • Gaofei Zhang + 4 more

  • Research Article
  • 10.1186/s12874-025-02694-z
Comparison of machine learning methods versus traditional Cox regression for survival prediction in cancer using real-world data: a systematic literature review and meta-analysis
  • Oct 28, 2025
  • BMC Medical Research Methodology
  • Yinan Huang + 6 more

Save Icon
Up Arrow
Open/Close
  • Ask R Discovery Star icon
  • Chat PDF Star icon
Setting-up Chat
Loading Interface