Self-Training With Quantile Errors for Multivariate Missing Data Imputation for Regression Problems in Electronic Medical Records: Algorithm Development Study.

Hansle Gwon,Ha Na Cho,Young-Hak Kim,Hee Jun Kang,Hyeram Seo,Heejung Choi,Imjin Ahn,Tae Joon Jun,Yunha Kim

doi:10.2196/30824

Hansle Gwon, Ha Na Cho + Show 7 more

Open Access

https://doi.org/10.2196/30824

Copy DOI

Abstract

BackgroundWhen using machine learning in the real world, the missing value problem is the first problem encountered. Methods to impute this missing value include statistical methods such as mean, expectation-maximization, and multiple imputations by chained equations (MICE) as well as machine learning methods such as multilayer perceptron, k-nearest neighbor, and decision tree.ObjectiveThe objective of this study was to impute numeric medical data such as physical data and laboratory data. We aimed to effectively impute data using a progressive method called self-training in the medical field where training data are scarce.MethodsIn this paper, we propose a self-training method that gradually increases the available data. Models trained with complete data predict the missing values in incomplete data. Among the incomplete data, the data in which the missing value is validly predicted are incorporated into the complete data. Using the predicted value as the actual value is called pseudolabeling. This process is repeated until the condition is satisfied. The most important part of this process is how to evaluate the accuracy of pseudolabels. They can be evaluated by observing the effect of the pseudolabeled data on the performance of the model.ResultsIn self-training using random forest (RF), mean squared error was up to 12% lower than pure RF, and the Pearson correlation coefficient was 0.1% higher. This difference was confirmed statistically. In the Friedman test performed on MICE and RF, self-training showed a P value between .003 and .02. A Wilcoxon signed-rank test performed on the mean imputation showed the lowest possible P value, 3.05e-5, in all situations.ConclusionsSelf-training showed significant results in comparing the predicted values and actual values, but it needs to be verified in an actual machine learning system. And self-training has the potential to improve performance according to the pseudolabel evaluation method, which will be the main subject of our future research.

Highlights

BackgroundWhen trying to use data in machine learning or statistical analysis, the missing value problem is one of the most common challenges
A teacher dataset with 10,000 data points represents complete data without missing values, and the 50,000 students contain missing values
We named the complete data set the teacher and the data set with missing values the student

Summary

Introduction

BackgroundWhen trying to use data in machine learning or statistical analysis, the missing value problem is one of the most common challenges. A missing value is caused by situations such as a malfunction of the inspection machine, incorrect inspection, or human error. It can happen when converting data for analysis purposes. Results: In self-training using random forest (RF), mean squared error was up to 12% lower than pure RF, and the Pearson correlation coefficient was 0.1% higher. Conclusions: Self-training showed significant results in comparing the predicted values and actual values, but it needs to be verified in an actual machine learning system. Self-training has the potential to improve performance according to the pseudolabel evaluation method, which will be the main subject of our future research

Methods

Results

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: JMIR Public Health and Surveillance	Publication Date: Oct 13, 2021
Citations: 4	License type: cc-by

R Discovery Prime

R Discovery Prime

Self-Training With Quantile Errors for Multivariate Missing Data Imputation for Regression Problems in Electronic Medical Records: Algorithm Development Study.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: JMIR Public Health and Surveillance

Lead the way for us

Similar Papers

Evaluation of machine learning methods for covariate data imputation in pharmacometrics.
Dominic Stefan Bräm ... Marc Pfister
CPT: Pharmacometrics & Systems Pharmacology | VOL. 11
Dominic Stefan Bräm, et. al.Dominic Stefan Bräm ... Marc Pfister
08 Nov 2022
CPT: Pharmacometrics & Systems Pharmacology | VOL. 11

What is missing from my missing data plan?
Sharon D Yeatts ... Renée H Martin
Stroke | VOL. 46
Sharon D Yeatts, et. al.Sharon D Yeatts ... Renée H Martin
07 May 2015
Stroke | VOL. 46

Simulation study on missing data imputation methods for longitudinal data in cohort studies
Y M Li ... F Y Chen
Zhonghua liu xing bing xue za zhi = Zhonghua liuxingbingxue zazhi | VOL. 42
Y M Li, et. al.Y M Li ... F Y Chen
10 Oct 2021
Zhonghua liu xing bing xue za zhi = Zhonghua liuxingbingxue zazhi | VOL. 42

A Comparison of Strategies for Missing Values in Data on Machine Learning Classification Algorithms
Tebogo Makaba ... Eustace Dogo
-
Tebogo Makaba, et. al.Tebogo Makaba ... Eustace Dogo
01 Nov 2019
01 Nov 2019

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Self-Training With Quantile Errors for Multivariate Missing Data Imputation for Regression Problems in Electronic Medical Records: Algorithm Development Study.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: JMIR Public Health and Surveillance