Accuracy of random-forest-based imputation of missing data in the presence of non-normality, non-linearity, and interaction

Shangzhi Hong,Henry S Lynn

doi:10.1186/s12874-020-01080-1

Shangzhi Hong, Henry S Lynn

Open Access

https://doi.org/10.1186/s12874-020-01080-1

Copy DOI

Journal: BMC Medical Research Methodology	Publication Date: Jul 25, 2020
Citations: 121	License type: open-access

Affiliation: Fudan University

Abstract

BackgroundMissing data are common in statistical analyses, and imputation methods based on random forests (RF) are becoming popular for handling missing data especially in biomedical research. Unlike standard imputation approaches, RF-based imputation methods do not assume normality or require specification of parametric models. However, it is still inconclusive how they perform for non-normally distributed data or when there are non-linear relationships or interactions.MethodsTo examine the effects of these three factors, a variety of datasets were simulated with outcome-dependent missing at random (MAR) covariates, and the performances of the RF-based imputation methods missForest and CALIBERrfimpute were evaluated in comparison with predictive mean matching (PMM).ResultsBoth missForest and CALIBERrfimpute have high predictive accuracy but missForest can produce severely biased regression coefficient estimates and downward biased confidence interval coverages, especially for highly skewed variables in nonlinear models. CALIBERrfimpute typically outperforms missForest when estimating regression coefficients, although its biases are still substantial and can be worse than PMM for logistic regression relationships with interaction.ConclusionsRF-based imputation, in particular missForest, should not be indiscriminately recommended as a panacea for imputing missing data, especially when data are highly skewed and/or outcome-dependent MAR. A correct analysis requires a careful critique of the missing data mechanism and the inter-relationships between the variables in the data.

Highlights

Missing data are common in statistical analyses, and imputation methods based on random forests (RF) are becoming popular for handling missing data especially in biomedical research
In a comparison study done by Waljee et al [6], missForest was found to consistently produce the lowest imputation error compared with other imputation methods, including k-nearest neighbors (k-NN) imputation and “mice” [7], when data were missing completely at random (MCAR)
Bias of variable estimates When estimating the mean of X across the eight distributions (Fig. 2), missForest on average gave relative biases of 2.0, 1.3, 1.7, 1.4%, compared to 1.4, 2.5, 2.3, 1.7% in CALIBERrfimpute, 3.2, 1.4, 2.7, 5.3% in predictive mean matching (PMM) for scenarios 1 through 4, respectively. (To be concise, we report in the text the mean of the absolute values of the mean relative bias for each distribution when summarizing the relative bias across the eight distributions.) MissForest had the smallest bias except for scenario 1

Summary

Introduction

Missing data are common in statistical analyses, and imputation methods based on random forests (RF) are becoming popular for handling missing data especially in biomedical research. RF-based imputation methods do not assume normality or require specification of parametric models. It is still inconclusive how they perform for non-normally distributed data or when there are non-linear relationships or interactions. Missing data are common in clinical and public health studies, and imputation methods based on machine learning algorithms, especially those based on random forest (RF) are gaining acceptance [1]. The differences between CALIBERrfimpute and missForest imputation on statistical analyses warrant further investigation

Methods

Results

Discussion

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Accuracy of random-forest-based imputation of missing data in the presence of non-normality, non-linearity, and interaction

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Medical Research Methodology

Lead the way for us

Similar Papers

A Machine Learning-Based Multiple Imputation Method for the Health and Aging Brain Study–Health Disparities
Fan Zhang ... Melissa Petersen
Informatics | VOL. 10
Fan Zhang, et. al.Fan Zhang ... Melissa Petersen
11 Oct 2023
Informatics | VOL. 10

Comparison of techniques for handling missing covariate data within prognostic modelling studies: a simulation study
Andrea Marshall ... Patrick Royston
BMC Medical Research Methodology | VOL. 10
Andrea Marshall, et. al.Andrea Marshall ... Patrick Royston
19 Jan 2010
BMC Medical Research Methodology | VOL. 10

Missing data in bioarchaeology II: A test of ordinal and continuous data imputation.
Amanda Wissler ... Kelly E Blevins
American journal of biological anthropology | VOL. 179
Amanda Wissler, et. al.Amanda Wissler ... Kelly E Blevins
12 Sep 2022
American journal of biological anthropology | VOL. 179

Classification of breast cancer recurrence based on imputed data: a simulation study
Rahibu A Abassi ... Amina S Msengwa
BioData Mining | VOL. 15
Rahibu A Abassi, et. al.Rahibu A Abassi ... Amina S Msengwa
07 Dec 2022
BioData Mining | VOL. 15

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Accuracy of random-forest-based imputation of missing data in the presence of non-normality, non-linearity, and interaction

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Medical Research Methodology