Evaluating the utility of synthetic COVID-19 case data.

Khaled El Emam,Elizabeth Jonker,Harpreet Sood,Lucy Mosquera

doi:10.1093/jamiaopen/ooab012

Abstract

BackgroundConcerns about patient privacy have limited access to COVID-19 datasets. Data synthesis is one approach for making such data broadly available to the research community in a privacy protective manner.ObjectivesEvaluate the utility of synthetic data by comparing analysis results between real and synthetic data.MethodsA gradient boosted classification tree was built to predict death using Ontario’s 90 514 COVID-19 case records linked with community comorbidity, demographic, and socioeconomic characteristics. Model accuracy and relationships were evaluated, as well as privacy risks. The same model was developed on a synthesized dataset and compared to one from the original data.ResultsThe AUROC and AUPRC for the real data model were 0.945 [95% confidence interval (CI), 0.941–0.948] and 0.34 (95% CI, 0.313–0.368), respectively. The synthetic data model had AUROC and AUPRC of 0.94 (95% CI, 0.936–0.944) and 0.313 (95% CI, 0.286–0.342) with confidence interval overlap of 45.05% and 52.02% when compared with the real data. The most important predictors of death for the real and synthetic models were in descending order: age, days since January 1, 2020, type of exposure, and gender. The functional relationships were similar between the two data sets. Attribute disclosure risks were 0.0585, and membership disclosure risk was low.ConclusionsThis synthetic dataset could be used as a proxy for the real dataset.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: JAMIA Open	Publication Date: Mar 1, 2021
Citations: 28	License type: CC BY-NC 4.0

R Discovery Prime

R Discovery Prime

Evaluating the utility of synthetic COVID-19 case data.

Abstract

Talk to us

Similar Papers

More From: JAMIA Open

Lead the way for us

Similar Papers

Evaluating the Utility and Privacy of Synthetic Breast Cancer Clinical Trial Data Sets.
Samer El Kababji ... Lois Shepherd
JCO clinical cancer informatics | VOL. 7
Samer El Kababji, et. al.Samer El Kababji ... Lois Shepherd
01 Sep 2023
JCO clinical cancer informatics | VOL. 7

Assessing privacy and quality of synthetic health data
Andrew Yale ... Adrien Pavao
-
Andrew Yale, et. al.Andrew Yale ... Adrien Pavao
13 May 2019
13 May 2019

Utility measures for evaluating synthetic data
-
-
--
20 Apr 2023
20 Apr 2023

Synthetic Data Generation By Artificial Intelligence to Accelerate Translational Research and Precision Medicine in Hematological Malignancies
Saverio D'Amico ...
Blood | VOL. 140
Saverio D'Amico, et. al.Saverio D'Amico ...
15 Nov 2022
Blood | VOL. 140

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Evaluating the utility of synthetic COVID-19 case data.

Abstract

Talk to us

Similar Papers

More From: JAMIA Open