Abstract

Many clinical research datasets have a large percentage of missing values that directly impacts their usefulness in yielding high accuracy classifiers when used for training in supervised machine learning. While missing value imputation methods have been shown to work well with smaller percentages of missing values, their ability to impute sparse clinical research data can be problem specific. We previously attempted to learn quantitative guidelines for ordering cardiac magnetic resonance imaging during the evaluation for pediatric cardiomyopathy, but missing data significantly reduced our usable sample size. In this work, we sought to determine if increasing the usable sample size through imputation would allow us to learn better guidelines. We first review several machine learning methods for estimating missing data. Then, we apply four popular methods (mean imputation, decision tree, k-nearest neighbors, and self-organizing maps) to a clinical research dataset of pediatric patients undergoing evaluation for cardiomyopathy. Using Bayesian Rule Learning (BRL) to learn ruleset models, we compared the performance of imputation-augmented models versus unaugmented models. We found that all four imputation-augmented models performed similarly to unaugmented models. While imputation did not improve performance, it did provide evidence for the robustness of our learned models.

Highlights

  • In biomedical research, samples with missing values are typically discarded to obtain a complete dataset

  • Missing data mechanisms can be categorized into three types: missing completely at random (MCAR), missing at random (MAR), or missing not at random (MNAR) [6,7]

  • We reviewed 12 papers that compared the performance of different imputation methods; they are summarized in Supplementary Table S1, with information on methods and their evaluation, along with types of datasets used and performance results reported

Read more

Summary

Introduction

Samples with missing values are typically discarded to obtain a complete dataset. Since the early 2000s, a new paradigm of thinking has emerged where missing values are treated as unknown values to be learned through a machine learning model In this framework, data samples with observed values for a particular variable are used as a training set for a machine learning model, which is applied to the data samples with missing values to impute them. Missing data mechanisms can be categorized into three types: missing completely at random (MCAR), missing at random (MAR), or missing not at random (MNAR) [6,7] Simple methods such as listwise deletion or mean imputation will only be unbiased when data are MCAR. In the best case scenario, this pattern of missingness can be modeled using prior knowledge in order to bring the data closer to MAR and improve the quality of imputations obtained through methods that assume MAR.

Objectives
Methods
Discussion
Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.