Membership Disclosure Research Articles

Synthetic data generation is being increasingly used as a privacy preserving approach for sharing health data. In addition to protecting privacy, it is important to ensure that generated data has high utility. A common way to assess utility is the ability of synthetic data to replicate results from the real data. Replicability has been defined using two criteria: (a) replicate the results of the analyses on real data, and (b) ensure valid population inferences from the synthetic data. A simulation study using three heterogeneous real-world datasets evaluated the replicability of logistic regression workloads. Eight replicability metrics were evaluated: decision agreement, estimate agreement, standardized difference, confidence interval overlap, bias, confidence interval coverage, statistical power, and precision (empirical SE). The analysis of synthetic data used a multiple imputation approach whereby up to 20 datasets were generated and the fitted logistic regression models were combined using combining rules for fully synthetic datasets. The effects of synthetic data amplification were evaluated, and two types of generative models were used: sequential synthesis using boosted decision trees and a generative adversarial network (GAN). Privacy risk was evaluated using a membership disclosure metric. For sequential synthesis, adjusted model parameters after combining at least ten synthetic datasets gave high decision and estimate agreement, low standardized difference, as well as high confidence interval overlap, low bias, the confidence interval had nominal coverage, and power close to the nominal level. Amplification had only a marginal benefit. Confidence interval coverage from a single synthetic dataset without applying combining rules were erroneous, and statistical power, as expected, was artificially inflated when amplification was used. Sequential synthesis performed considerably better than the GAN across multiple datasets. Membership disclosure risk was low for all datasets and models. For replicable results, the statistical analysis of fully synthetic data should be based on at least ten generated datasets of the same size as the original whose analyses results are combined. Analysis results from synthetic data without applying combining rules can be misleading. Replicability results are dependent on the type of generative model used, with our study suggesting that sequential synthesis has good replicability characteristics for common health research workloads.

Read full abstract

1554 Background: There is strong interest by researchers, the pharmaceutical industry, medical journal editors, funders of research, and regulators in sharing clinical trial data. Reusing data extracts the most utility possible from patient contributions. The majority of patients do want to share their data for secondary research purposes. However, data access for secondary analysis remains a challenge. A key reason why individual-level data is not made directly available to data users by authors and data custodians is concern over breaches of patient privacy. Synthetic data generation (SDG) is an effective way to address privacy concerns that can enable the broader sharing of clinical trial datasets. However, a key question is whether the reproducibility of the generated data is adequate to draw reliable conclusions. Methods: We synthesized datasets from five pragmatic breast cancer clinical trials performed by the REaCT group (https://react.ohri.ca/). A sequential synthesis method, a type of machine learning was performed. The published analysis of each trial was repeated on each synthetic dataset to evaluate reproducibility. We evaluated reproducibility on three criteria: (a) decision agreement: the direction and statistical significance of the primary endpoint effect estimates are the same as the real data, (b) estimate agreement: the parameter estimates from the synthetic data are within the 95% confidence interval of the real data, and (c) the confidence interval overlap between real and synthetic parameters is above 50%. In addition, we evaluated privacy using a membership disclosure metric. This evaluates the ability of an adversary to determine that a target individual was in the original dataset using the synthetic data, computed as an F1 classification accuracy score. Results: Our results show that decision and estimate agreements held true across all five trials, and the confidence interval overlap was high. The risks of membership disclosure are all below the established 0.2 threshold. Conclusions: In this study, we were able to successfully generate synthetic datasets that accurately replicated original data from 5 oncology trials and yielded the same results as in the original published studies, with a very low risk of membership disclosure. With proper modeling techniques, synthetic datasets can play a key role in data democratization and the reuse of oncology clinical trials.[Table: see text]

Read full abstract

Membership Disclosure Research Articles

Related Topics

Articles published on Membership Disclosure

Cluster-based anonymity model and algorithm for 1:1 dataset with a single sensitive attribute using machine learning technique

An evaluation of the replicability of analyses using synthetic health data

Evaluating the Utility and Privacy of Synthetic Breast Cancer Clinical Trial Data Sets.

A comparison of synthetic data generation and federated analysis for enabling international evaluations of cardiovascular health

Can synthetic data accurately mimic oncology clinical trials?

AN EVALUATION OF DATA ANONYMIZATION METHODS FOR DATA PUBLISHING

Validating a membership disclosure metric for synthetic health data.

Slicing-Based Enhanced Method for Privacy-Preserving in Publishing Big Data

Evaluating the utility of synthetic COVID-19 case data.

An efficient clustering-based anonymization scheme for privacy-preserving data collection in IoT based healthcare services

Secure Data Protection Using Slicing as a Confusion Technique

Managing dimensionality in data privacy anonymization

A REVIEW ON ANONYMIZATION TECHNIQUES FOR PRIVACY PRESERVING DATA PUBLISHING

Managing Privacy of Sensitive Attributes Using MFSARNN Clustering with Optimization Technique

English

Privacy Preserving Data Publishing through Slicing

English

Secure Access to High-dimensional Data through Slicing using Grouping Algorithm

A Novel Framework for Privacy Conserving Data Publishing and Handling High Dimensional Data

DATA SLICING TECHNIQUE TO PRIVACY PRESERVING AND DATA PUBLISHING

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Membership Disclosure Research Articles

Related Topics

Articles published on Membership Disclosure

Cluster-based anonymity model and algorithm for 1:1 dataset with a single sensitive attribute using machine learning technique

An evaluation of the replicability of analyses using synthetic health data

Evaluating the Utility and Privacy of Synthetic Breast Cancer Clinical Trial Data Sets.

A comparison of synthetic data generation and federated analysis for enabling international evaluations of cardiovascular health

Can synthetic data accurately mimic oncology clinical trials?

AN EVALUATION OF DATA ANONYMIZATION METHODS FOR DATA PUBLISHING

Validating a membership disclosure metric for synthetic health data.

Slicing-Based Enhanced Method for Privacy-Preserving in Publishing Big Data

Evaluating the utility of synthetic COVID-19 case data.

An efficient clustering-based anonymization scheme for privacy-preserving data collection in IoT based healthcare services

Secure Data Protection Using Slicing as a Confusion Technique

Managing dimensionality in data privacy anonymization

A REVIEW ON ANONYMIZATION TECHNIQUES FOR PRIVACY PRESERVING DATA PUBLISHING

Managing Privacy of Sensitive Attributes Using MFSARNN Clustering with Optimization Technique

English

Privacy Preserving Data Publishing through Slicing

English

Secure Access to High-dimensional Data through Slicing using Grouping Algorithm

A Novel Framework for Privacy Conserving Data Publishing and Handling High Dimensional Data

DATA SLICING TECHNIQUE TO PRIVACY PRESERVING AND DATA PUBLISHING