Validating a membership disclosure metric for synthetic health data.

Khaled El Emam,Xi Fang,Lucy Mosquera

doi:10.1093/jamiaopen/ooac083

Khaled El Emam, Xi Fang + Show 1 more

Open Access

https://doi.org/10.1093/jamiaopen/ooac083

Copy DOI

Abstract

BackgroundOne of the increasingly accepted methods to evaluate the privacy of synthetic data is by measuring the risk of membership disclosure. This is a measure of the F1 accuracy that an adversary would correctly ascertain that a target individual from the same population as the real data is in the dataset used to train the generative model, and is commonly estimated using a data partitioning methodology with a 0.5 partitioning parameter.ObjectiveValidate the membership disclosure F1 score, evaluate and improve the parametrization of the partitioning method, and provide a benchmark for its interpretation.Materials and methodsWe performed a simulated membership disclosure attack on 4 population datasets: an Ontario COVID-19 dataset, a state hospital discharge dataset, a national health survey, and an international COVID-19 behavioral survey. Two generative methods were evaluated: sequential synthesis and a generative adversarial network. A theoretical analysis and a simulation were used to determine the correct partitioning parameter that would give the same F1 score as a ground truth simulated membership disclosure attack.ResultsThe default 0.5 parameter can give quite inaccurate membership disclosure values. The proportion of records from the training dataset in the attack dataset must be equal to the sampling fraction of the real dataset from the population. The approach is demonstrated on 7 clinical trial datasets.ConclusionsOur proposed parameterization, as well as interpretation and generative model training guidance provide a theoretically and empirically grounded basis for evaluating and managing membership disclosure risk for synthetic data.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: JAMIA Open	Publication Date: Oct 4, 2022
Citations: 12	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Validating a membership disclosure metric for synthetic health data.

Abstract

Talk to us

Similar Papers

More From: JAMIA Open

Lead the way for us

Similar Papers

An evaluation of the replicability of analyses using synthetic health data
Khaled El Emam ... Alaa El-Hussuna
Scientific Reports | VOL. 14
Khaled El Emam, et. al.Khaled El Emam ... Alaa El-Hussuna
24 Mar 2024
Scientific Reports | VOL. 14

GLSTM: A novel approach for prediction of real & synthetic PID diabetes data using GANs and LSTM classification model
Priyanka Gupta ... Sushma Jaiswal
International Journal of Experimental Research and Review | VOL. 30
Priyanka Gupta, et. al.Priyanka Gupta ... Sushma Jaiswal
30 Apr 2023
International Journal of Experimental Research and Review | VOL. 30

Evaluating the Utility and Privacy of Synthetic Breast Cancer Clinical Trial Data Sets.
Samer El Kababji ... Ana-Alicia Beltran-Bless
JCO clinical cancer informatics | VOL. 7
Samer El Kababji, et. al.Samer El Kababji ... Ana-Alicia Beltran-Bless
01 Sep 2023
JCO clinical cancer informatics | VOL. 7

Can synthetic data accurately mimic oncology clinical trials?
Samer El Kababji ... Xi Fang
Journal of Clinical Oncology | VOL. 41
Samer El Kababji, et. al.Samer El Kababji ... Xi Fang
01 Jun 2023
Journal of Clinical Oncology | VOL. 41

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Validating a membership disclosure metric for synthetic health data.

Abstract

Talk to us

Similar Papers

More From: JAMIA Open