Reliability of Supervised Machine Learning Using Synthetic Data in Health Care: Model to Preserve Privacy for Data Sharing.

Debbie Rankin,Jonathan Wallace,Gorka Epelde,Raymond Bond,Michaela Black,Maurice Mulvenna

doi:10.2196/18910

Abstract

BackgroundThe exploitation of synthetic data in health care is at an early stage. Synthetic data could unlock the potential within health care datasets that are too sensitive for release. Several synthetic data generators have been developed to date; however, studies evaluating their efficacy and generalizability are scarce.ObjectiveThis work sets out to understand the difference in performance of supervised machine learning models trained on synthetic data compared with those trained on real data.MethodsA total of 19 open health datasets were selected for experimental work. Synthetic data were generated using three synthetic data generators that apply classification and regression trees, parametric, and Bayesian network approaches. Real and synthetic data were used (separately) to train five supervised machine learning models: stochastic gradient descent, decision tree, k-nearest neighbors, random forest, and support vector machine. Models were tested only on real data to determine whether a model developed by training on synthetic data can used to accurately classify new, real examples. The impact of statistical disclosure control on model performance was also assessed.ResultsA total of 92% of models trained on synthetic data have lower accuracy than those trained on real data. Tree-based models trained on synthetic data have deviations in accuracy from models trained on real data of 0.177 (18%) to 0.193 (19%), while other models have lower deviations of 0.058 (6%) to 0.072 (7%). The winning classifier when trained and tested on real data versus models trained on synthetic data and tested on real data is the same in 26% (5/19) of cases for classification and regression tree and parametric synthetic data and in 21% (4/19) of cases for Bayesian network-generated synthetic data. Tree-based models perform best with real data and are the winning classifier in 95% (18/19) of cases. This is not the case for models trained on synthetic data. When tree-based models are not considered, the winning classifier for real and synthetic data is matched in 74% (14/19), 53% (10/19), and 68% (13/19) of cases for classification and regression tree, parametric, and Bayesian network synthetic data, respectively. Statistical disclosure control methods did not have a notable impact on data utility.ConclusionsThe results of this study are promising with small decreases in accuracy observed in models trained with synthetic data compared with models trained with real data, where both are tested on real data. Such deviations are expected and manageable. Tree-based classifiers have some sensitivity to synthetic data, and the underlying cause requires further investigation. This study highlights the potential of synthetic data and the need for further evaluation of their robustness. Synthetic data must ensure individual privacy and data utility are preserved in order to instill confidence in health care departments when using such data to inform policy decision-making.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: JMIR Medical Informatics	Publication Date: Jul 20, 2020
Citations: 81	License type: cc-by

R Discovery Prime

R Discovery Prime

Reliability of Supervised Machine Learning Using Synthetic Data in Health Care: Model to Preserve Privacy for Data Sharing.

Abstract

Talk to us

Similar Papers

More From: JMIR Medical Informatics

Lead the way for us

Similar Papers

Synthetic data in health care: A narrative review.
Aldren Gonzales ... Guruprabha Guruswamy
PLOS Digital Health | VOL. 2
Aldren Gonzales, et. al.Aldren Gonzales ... Guruprabha Guruswamy
06 Jan 2023
PLOS Digital Health | VOL. 2

Secure and efficient anonymization of distributed confidential databases
Javier Herranz ... Jordi Nin
International Journal of Information Security | VOL. 13
Javier Herranz, et. al.Javier Herranz ... Jordi Nin
23 Apr 2014
International Journal of Information Security | VOL. 13

Differential Correct Attribution Probability for Synthetic Data: An Exploration
Jennifer Taub ... Duncan Smith
-
Jennifer Taub, et. al.Jennifer Taub ... Duncan Smith
01 Jan 2018
01 Jan 2018

Vine copula statistical disclosure control for mixed-type data
Amanda M.Y Chu ... Mike K.P So
Computational Statistics & Data Analysis | VOL. 176
Amanda M.Y Chu, et. al.Amanda M.Y Chu ... Mike K.P So
04 Jul 2022
Computational Statistics & Data Analysis | VOL. 176

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Reliability of Supervised Machine Learning Using Synthetic Data in Health Care: Model to Preserve Privacy for Data Sharing.

Abstract

Talk to us

Similar Papers

More From: JMIR Medical Informatics