Accelerate Literature Icon
Want to do a literature review? Try our new Literature Review workflow

A case study comparing anonymized and synthetic health insurance claims data for medication safety assessments.

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon

Synthetic data generation is increasingly proposed as an alternative to classical anonymization for sharing health data. We compared concrete applications of both approaches on a small, high-dimensional health claims dataset, assessing their impact on fidelity, reproducibility of study outcomes, and privacy risks. To reflect different sharing contexts, we considered a context-independent, higher-risk scenario with no assumptions about potential attacks, and a context-dependent, lower-risk scenario informed by threat modeling. Analyses on anonymized and synthetic data yielded results similar to those from the original study data, but came at the cost of higher uncertainty when estimating hazard ratios. As expected, higher data utility and fidelity were related to higher privacy risks. Our findings provide a reusable workflow and comparative insights into anonymization and synthetization and show that both methods are valuable means to lower privacy risks in data sharing scenarios but verifying results on the original data should be done whenever possible.

Similar Papers
  • Research Article
  • Cite Count Icon 24
  • 10.1200/cci.23.00116
Evaluating the Utility and Privacy of Synthetic Breast Cancer Clinical Trial Data Sets
  • Sep 1, 2023
  • JCO Clinical Cancer Informatics
  • Samer El Kababji + 15 more

PURPOSEThere is strong interest from patients, researchers, the pharmaceutical industry, medical journal editors, funders of research, and regulators in sharing clinical trial data for secondary analysis. However, data access remains a challenge because of concerns about patient privacy. It has been argued that synthetic data generation (SDG) is an effective way to address these privacy concerns. There is a dearth of evidence supporting this on oncology clinical trial data sets, and on the utility of privacy-preserving synthetic data. The objective of the proposed study is to validate the utility and privacy risks of synthetic clinical trial data sets across multiple SDG techniques.METHODSWe synthesized data sets from eight breast cancer clinical trial data sets using three types of generative models: sequential synthesis, conditional generative adversarial network, and variational autoencoder. Synthetic data utility was evaluated by replicating the published analyses on the synthetic data and assessing concordance of effect estimates and CIs between real and synthetic data. Privacy was evaluated by measuring attribution disclosure risk and membership disclosure risk.RESULTSUtility was highest using the sequential synthesis method where all results were replicable and the CI overlap most similar or higher for seven of eight data sets. Both types of privacy risks were low across all three types of generative models.DISCUSSIONSynthetic data using sequential synthesis methods can act as a proxy for real clinical trial data sets, and simultaneously have low privacy risks. This type of generative model can be one way to enable broader sharing of clinical trial data.

  • Preprint Article
  • 10.2196/preprints.71364
Synthetic Data in Child and Adolescent Mental Health Service Research: A Tool Whose Time has Come. (Preprint)
  • Jan 16, 2025
  • Mounir Haizoune

BACKGROUND High-quality, large-scale healthcare research, especially those using medical records, encounters significant challenges related to technical difficulties and confidentiality issues. As a result, critical research questions about patient evaluation and treatment have been left unanswered. Moreover, the presence of stigma and increased sensitivity surrounding mental health issues have resulted in a significant delay in research progress, particularly concerning Child and Adolescent Mental Health Services (CAMHS). OBJECTIVE These challenges can be effectively addressed by generating synthetic data, which not only safeguard individual privacy but also facilitate comprehensive analyses of clinical information from EMRs and other clinical data sources. To exemplify this method, we have utilized CAMHS synthetic data for planning the allocation of mental health resources, while ensuring confidentiality. In the process, using mental health clinical data, we demonstrate how to create and successfully analyse synthetic data from large-scale EMR-based data to answer critical health care questions for policymakers and clinicians. METHODS The study was carried out on a retrospectively collected cohort comprising 6,924 distinct patients from the Child and Adolescent Mental Health Services (CAMHS) in Stavanger, Norway. The analysis included 7,730 referral periods and a total of 58,524 episodes of care. The full dataset was divided into a training cohort (n = 6184 referrals, 58524 episodes of care) and an independent, fixed test set (n = 1564 referrals, 14,610 episodes of care). A hierarchical synthetic data generation model was used to generate synthetic referral periods with the associated episodes of care based on “real-world” CAMHS data. In addition to the utility of the data, the quality and privacy risk of the generated synthetic data were assessed. RESULTS The synthetic hierarchical data generation model created reproducible synthetic CAMHS data with properties very similar to “real-world” data (KS/TVD Complement score =0.92, CS score =0.77, CS (Inter-table) score =0.75 and CSS score=0.92), while demonstrating low risk score when exposed to a set of privacy attacks (average Singleout score(univariate)=0.17, average Singleout score(multivariate)=0.04, average Linkability risk=2.5, average inference risk=0.7). The predictive model trained on synthetic data produced comparable performance to the model trained on real data in the context of classifying the intensity of care required by patients, all while maintaining the interpretability of the utilized features. (for n = 656, 1546, 3092 and 6184, average PR_AUC = 0.32, 0.33, 0.34 and 0.40 respectively, compared to PR_AUC =0.43 when using n=6184 real data records). CONCLUSIONS Synthetic data in Child and Adolescent Mental Health Services (CAMHS) balances data utility with fairness and privacy protection.It fosters trust between patients and healthcare providers while promoting collaboration among researchers by offering access to extensive and representative samples with a low risk of patient identification. This approach not only encourages data sharing but also expands the breadth of research while safeguarding patient privacy. Effective implementation of synthetic data generation methods in CAMHS depends on the model's ability to accurately identify and replicate the complex patterns present in real data, while maintaining consistency across various outputs. Therefore, selecting the appropriate technique is crucial for achieving accurate and insightful research findings in this field CLINICALTRIAL The synthetic hierarchical data generation model created reproducible synthetic CAMHS data with properties very similar to “real-world” data (for n = 656 ,KS/TVD Complement score =0.92, CS score =0.77, CS (Inter-table) score =0.75 and CSS score=0.92), while demonstrating low risk score when exposed to a set of privacy attacks (for n = 656, average Singleout score(univariate)=0.17, average Singleout score(multivariate)=0.04, average Linkability risk=2.5, average inference risk=0.7). The predictive model trained on synthetic data produced comparable performance to the model trained on real data in the context of classifying the intensity of care required by patients, all while maintaining the interpretability of the utilized features. (for n = 656, 1546, 3092 and 6184, average PR_AUC = 0.32, 0.33, 0.34 and 0.40 respectively, compared to PR_AUC =0.43 when using n=6184 real data records).

  • Research Article
  • Cite Count Icon 2
  • 10.3233/shti240490
On the Fidelity-Privacy Tradeoff of Synthetic Cancer Registry Data.
  • Aug 22, 2024
  • Studies in health technology and informatics
  • Philipp Röchner

The sharing of personal health data is highly regulated due to privacy and security concerns. An alternative to sharing personal data is to share synthetic data, because ideally it should be impossible to reconstruct real personal data from synthetic data, which is called privacy. At the same time, the structure of the synthetic data should be as similar as possible to the structure of the real data to ensure that conclusions drawn from the synthetic data are also valid for the real data, which is called fidelity. Typically, there is a tradeoff between fidelity and privacy for synthetic health data. We study the fidelity and privacy of cancer data synthesized using generative machine learning approaches. To generate synthetic cancer data, we use variational autoencoders (VAEs), generative adversarial networks (GANs), and denoising diffusion probabilistic models (DDPMs). The tabular cancer registry data studied have nine categorical variables from breast cancer patients. We find that DDPMs generate synthetic cancer data with higher fidelity; that is, the structure of the synthetic data is more similar to the real cancer data than the data generated by VAEs and GANs. At the same time, synthetic cancer data from DDPMs pose a greater privacy risk because the data are more likely to reveal information from real patients than synthetic data from VAEs and GANs.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 73
  • 10.56553/popets-2023-0055
A Unified Framework for Quantifying Privacy Risk in Synthetic Data
  • Apr 1, 2023
  • Proceedings on Privacy Enhancing Technologies
  • Matteo Giomi + 3 more

Synthetic data is often presented as a method for sharing sensitive information in a privacy-preserving manner by reproducing the global statistical properties of the original data without dis closing sensitive information about any individual. In practice, as with other anonymization methods, synthetic data cannot entirely eliminate privacy risks. These residual privacy risks need instead to be ex-post uncovered and assessed. However, quantifying the actual privacy risks of any synthetic dataset is a hard task, given the multitude of facets of data privacy. We present Anonymeter, a statistical framework to jointly quantify different types of privacy risks in synthetic tabular datasets. We equip this framework with attack-based evaluations for the singling out, linkability, and inference risks, which are the three key indicators of factual anonymization according to data protection regulations, such as the European General Data Protection Regulation (GDPR). To the best of our knowledge, we are the first to introduce a coherent and legally aligned evaluation of these three privacy risks for synthetic data, as well as to design privacy attacks which model directly the singling out and linkability risks. We demonstrate the effectiveness of our methods by conducting an extensive set of experiments that measure the privacy risks of data with deliberately inserted privacy leakages, and of synthetic data generated with and without differential privacy. Our results highlight that the three privacy risks reported by our framework scale linearly with the amount of privacy leakage in the data. Furthermore, we observe that synthetic data exhibits the lowest vulnerability against linkability, indicating one-to-one relationships between real and synthetic data records are not preserved. Finally, with a quantitative comparison we demonstrate that Anonymeter outperforms existing synthetic data privacy evaluation frameworks both in terms of detecting privacy leaks, as well as computation speed. To contribute to a privacy-conscious usage of synthetic data, we publish Anonymeter as an open-source library (https://github.com/statice/anonymeter).

  • Abstract
  • 10.1182/blood-2024-209541
Generation of Multimodal Longitudinal Synthetic Data By Artificial Intelligence to Improve Personalized Medicine in Hematology
  • Nov 5, 2024
  • Blood
  • Saverio D'Amico + 41 more

Generation of Multimodal Longitudinal Synthetic Data By Artificial Intelligence to Improve Personalized Medicine in Hematology

  • Research Article
  • Cite Count Icon 4
  • 10.3171/2025.4.focus25225
Synthetic neurosurgical data generation with generative adversarial networks and large language models:an investigation on fidelity, utility, and privacy.
  • Jul 1, 2025
  • Neurosurgical focus
  • Austin A Barr + 3 more

Use of neurosurgical data for clinical research and machine learning (ML) model development is often limited by data availability, sample sizes, and regulatory constraints. Synthetic data offer a potential solution to challenges associated with accessing, sharing, and using real-world data (RWD). The aim of this study was to evaluate the capability of generating synthetic neurosurgical data with a generative adversarial network and large language model (LLM) to augment RWD, perform secondary analyses in place of RWD, and train an ML model to predict postoperative outcomes. Synthetic data were generated with a conditional tabular generative adversarial network (CTGAN) and the LLM GPT-4o based on a real-world neurosurgical dataset of 140 older adults who underwent neurosurgical interventions. Each model was used to generate datasets at equivalent (n = 140) and amplified (n = 1000) sample sizes. Data fidelity was evaluated by comparing univariate and bivariate statistics to the RWD. Privacy evaluation involved measuring the uniqueness of generated synthetic records. Utility was assessed by: 1) reproducing and extending clinical analyses on predictors of Karnofsky Performance Status (KPS) deterioration at discharge and a prolonged postoperative intensive care unit (ICU) stay, and 2) training a binary ML classifier on amplified synthetic datasets to predict KPS deterioration on RWD. Both the CTGAN and GPT-4o generated complete, high-fidelity synthetic tabular datasets. GPT-4o matched or exceeded CTGAN across all measured fidelity, utility, and privacy metrics. All significant clinical predictors of KPS deterioration and prolonged ICU stay were retained in the GPT-4o-generated synthetic data, with some differences observed in effect sizes. Preoperative KPS was not preserved as a significant predictor in the CTGAN-generated data. The ML classifier trained on GPT-4o data outperformed the model trained on CTGAN data, achieving a higher F1 score (0.725 vs 0.688) for predicting KPS deterioration. This study demonstrated a promising ability to produce high-fidelity synthetic neurosurgical data using generative models. Synthetic neurosurgical data present a potential solution to critical limitations in data availability for neurosurgical research. Further investigation is necessary to enhance synthetic data utility for secondary analyses and ML model training, and to evaluate synthetic data generation methods across other datasets, including clinical trial data.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 138
  • 10.2196/16492
Analyzing Medical Research Results Based on Synthetic Data and Their Relation to Real Data Results: Systematic Comparison From Five Observational Studies
  • Feb 20, 2020
  • JMIR Medical Informatics
  • Anat Reiner Benaim + 11 more

BackgroundPrivacy restrictions limit access to protected patient-derived health information for research purposes. Consequently, data anonymization is required to allow researchers data access for initial analysis before granting institutional review board approval. A system installed and activated at our institution enables synthetic data generation that mimics data from real electronic medical records, wherein only fictitious patients are listed.ObjectiveThis paper aimed to validate the results obtained when analyzing synthetic structured data for medical research. A comprehensive validation process concerning meaningful clinical questions and various types of data was conducted to assess the accuracy and precision of statistical estimates derived from synthetic patient data.MethodsA cross-hospital project was conducted to validate results obtained from synthetic data produced for five contemporary studies on various topics. For each study, results derived from synthetic data were compared with those based on real data. In addition, repeatedly generated synthetic datasets were used to estimate the bias and stability of results obtained from synthetic data.ResultsThis study demonstrated that results derived from synthetic data were predictive of results from real data. When the number of patients was large relative to the number of variables used, highly accurate and strongly consistent results were observed between synthetic and real data. For studies based on smaller populations that accounted for confounders and modifiers by multivariate models, predictions were of moderate accuracy, yet clear trends were correctly observed.ConclusionsThe use of synthetic structured data provides a close estimate to real data results and is thus a powerful tool in shaping research hypotheses and accessing estimated analyses, without risking patient privacy. Synthetic data enable broad access to data (eg, for out-of-organization researchers), and rapid, safe, and repeatable analysis of data in hospitals or other health organizations where patient privacy is a primary value.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 23
  • 10.1038/s41598-024-57207-7
An evaluation of the replicability of analyses using synthetic health data
  • Mar 24, 2024
  • Scientific Reports
  • Khaled El Emam + 3 more

Synthetic data generation is being increasingly used as a privacy preserving approach for sharing health data. In addition to protecting privacy, it is important to ensure that generated data has high utility. A common way to assess utility is the ability of synthetic data to replicate results from the real data. Replicability has been defined using two criteria: (a) replicate the results of the analyses on real data, and (b) ensure valid population inferences from the synthetic data. A simulation study using three heterogeneous real-world datasets evaluated the replicability of logistic regression workloads. Eight replicability metrics were evaluated: decision agreement, estimate agreement, standardized difference, confidence interval overlap, bias, confidence interval coverage, statistical power, and precision (empirical SE). The analysis of synthetic data used a multiple imputation approach whereby up to 20 datasets were generated and the fitted logistic regression models were combined using combining rules for fully synthetic datasets. The effects of synthetic data amplification were evaluated, and two types of generative models were used: sequential synthesis using boosted decision trees and a generative adversarial network (GAN). Privacy risk was evaluated using a membership disclosure metric. For sequential synthesis, adjusted model parameters after combining at least ten synthetic datasets gave high decision and estimate agreement, low standardized difference, as well as high confidence interval overlap, low bias, the confidence interval had nominal coverage, and power close to the nominal level. Amplification had only a marginal benefit. Confidence interval coverage from a single synthetic dataset without applying combining rules were erroneous, and statistical power, as expected, was artificially inflated when amplification was used. Sequential synthesis performed considerably better than the GAN across multiple datasets. Membership disclosure risk was low for all datasets and models. For replicable results, the statistical analysis of fully synthetic data should be based on at least ten generated datasets of the same size as the original whose analyses results are combined. Analysis results from synthetic data without applying combining rules can be misleading. Replicability results are dependent on the type of generative model used, with our study suggesting that sequential synthesis has good replicability characteristics for common health research workloads.

  • Research Article
  • Cite Count Icon 176
  • 10.1016/j.cosrev.2023.100546
Synthetic data generation: State of the art in health care domain
  • Feb 26, 2023
  • Computer Science Review
  • Hajra Murtaza + 5 more

Synthetic data generation: State of the art in health care domain

  • Research Article
  • 10.17269/s41997-026-01153-6
Synthetic health data in Canada: A scoping review of methods, applications, and data sources.
  • Feb 9, 2026
  • Canadian journal of public health = Revue canadienne de sante publique
  • Hassan Maleki Golandouz + 1 more

Access to provincial health-related data for multi-jurisdictional studies in Canada is restricted by privacy laws. Synthetic data (SD), which mimic real data, can facilitate privacy preservation. However, information on SD use in Canadian research is limited. To review characteristics, methods, and applications of published studies generating SD from Canadian health data (HD), including administrative, survey, public health, and clinical sources. We conducted a scoping review following Arksey and O'Malley, Joanna Briggs Institute, and Preferred Reporting Items for Systematic reviews and Meta-Analyses extension for Scoping Reviews guidelines on studies (to September 2024) generating SD from provincial/national HD. We included English-language peer-reviewed articles and grey literature, identified through PubMed, Scopus, Web of Science, Google, and references. We extracted and descriptively analyzed data on HD types, research purposes, geographic sources, synthesis methods, and quality evaluation. Of 232 identified articles, 31 were reviewed and nine met inclusion criteria; three additional articles were found through references and Google. Eleven articles were peer-reviewed. Topics included data replication, bias mitigation, and privacy-risk assessment. Survey data were most commonly synthesized. SD were generated from national/provincial datasets, including Canadian Community Health Survey and administrative/clinical data from Alberta, Manitoba, British Columbia, and Ontario. Synthesis methods included generative, sampling, and predictive models. Data quality evaluations assessed replicability, privacy risk, and predictive performance. SD have mainly been used in single-province studies and national surveys. Broader use in clinical and public HD with methodological consistency could strengthen its role for privacy-protecting, multi-jurisdictional research and surveillance initiatives.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 22
  • 10.23889/ijpds.v8i1.2158
Federated learning for generating synthetic data: a scoping review.
  • Oct 31, 2023
  • International journal of population data science
  • Claire Little + 2 more

Federated Learning (FL) is a decentralised approach to training statistical models, where training is performed across multiple clients, producing one global model. Since the training data remains with each local client and is not shared or exchanged with other clients the use of FL may reduce privacy and security risks (compared to methods where multiple data sources are pooled) and can also address data access and heterogeneity problems. Synthetic data is artificially generated data that has the same structure and statistical properties as the original but that does not contain any of the original data records, therefore minimising disclosure risk. Using FL to produce synthetic data (which we refer to as "federated synthesis") has the potential to combine data from multiple clients without compromising privacy, allowing access to data that may otherwise be inaccessible in its raw format. The objective was to review current research and practices for using FL to generate synthetic data and determine the extent to which research has been undertaken, the methods and evaluation practices used, and any research gaps. A scoping review was conducted to systematically map and describe the published literature on the use of FL to generate synthetic data. Relevant studies were identified through online databases and the findings are described, grouped, and summarised. Information extracted included article characteristics, documenting the type of data that is synthesised, the model architecture and the methods (if any) used to evaluate utility and privacy risk. A total of 69 articles were included in the scoping review; all were published between 2018 and 2023 with two thirds (46) in 2022. 30% (21) were focussed on synthetic data generation as the main model output (with 6 of these generating tabular data), whereas 59% (41) focussed on data augmentation. Of the 21 performing federated synthesis, all used deep learning methods (predominantly Generative Adversarial Networks) to generate the synthetic data. Federated synthesis is in its early days but shows promise as a method that can construct a global synthetic dataset without sharing any of the local client data. As a field in its infancy there are areas to explore in terms of the privacy risk associated with the various methods proposed, and more generally in how we measure those risks.

  • Research Article
  • 10.1182/blood-2025-4350
Development and validation of synthetic data generation over a federated learning computing framework to accelerate innovation and boost personalized medicine in hematological diseases
  • Nov 3, 2025
  • Blood
  • Gianluca Asti + 37 more

Development and validation of synthetic data generation over a federated learning computing framework to accelerate innovation and boost personalized medicine in hematological diseases

  • Dissertation
  • 10.63227/652.299.60
Big data vs Big law: The impact of big data and machine learning in anonymising or synthesizing data for use across borders.
  • Jan 1, 2025
  • Kenneth Darker + 1 more

This research investigates the viability of anonymization and synthetic data generation in the area of big data so that the data could be shared across borders and exist outside the constraints of privacy laws. These privacy laws are growing around the world to help protect individual identity and prevent open sharing of private data. These privacy laws all provide guidance on how data may be shared and the strict conditions upon how that may occur. Two methods which are growing in popularity are anonymization of data, specifically k-Anonymity, l-Diversity and t-Closeness, and generating synthetic data from a real dataset leveraging machine learning techniques. This paper explores some of these techniques and aims to effectively measure them as a solution to allow organizations to share big data outside of the constraints of privacy laws. The areas of measurement addressed are risk, utility, and usability. A number of measurements are discussed within the paper and implemented within the artifact to allow for comparative testing of different datasets. The focus for this paper is on healthcare and financial data. For anonymization, it was important to understand the quasi-identifiers within the datasets and the sensitive attributes that needed to be considered. These details were used to conduct the measurements around risk and utility. Synthetic data needed to be measured to understand how similar it was to the real data and if any potential leaks of the real data occurred. Both were measured separately, but for usability were tested together across several machine learning models. Across both experiments in healthcare and finance, the results showed that anonymized data contained minimal utility while introducing risk, while real synthetic data performed well, retained utility and demonstrated very low risk. That said, the usability measure showed that synthetic data, while close, doesn’t perform exactly the same as the real data, which could be an issue depending on use case. In conclusion, the synthetic version of the anonymized data appears to be a viable option that could be shared with low risk, good utility and potentially good usability. Keywords:

  • Research Article
  • Cite Count Icon 11
  • 10.3389/fdgth.2025.1576290
Comprehensive evaluation framework for synthetic tabular data in health: fidelity, utility and privacy analysis of generative models with and without privacy guarantees
  • Apr 24, 2025
  • Frontiers in Digital Health
  • Mikel Hernandez + 5 more

The generation of synthetic tabular data has emerged as a key privacy-enhancing technology to address challenges in data sharing, particularly in healthcare, where sensitive attributes can compromise patient privacy. Despite significant progress, balancing fidelity, utility, and privacy in complex medical datasets remains a substantial challenge. This paper introduces a comprehensive and holistic evaluation framework for synthetic tabular data, consolidating metrics and privacy risk measures across three key categories (fidelity, utility and privacy) and incorporating a fidelity-utility tradeoff metric. The framework was applied to three open-source medical datasets to evaluate synthetic tabular data generated by five generative models, both with and without differential privacy. Results showed that simpler models generally achieved better fidelity and utility, while more complex models provided lower privacy risks. The addition of differential privacy enhanced privacy preservation but often reduced fidelity and utility, highlighting the complexity of balancing fidelity, utility and privacy in synthetic data generation for medical datasets. Despite its contributions, this study acknowledges limitations, such as the lack of evaluation metrics neither privacy risk measures for required model training time and resource usage, reliance on default model parameters, and the assessment of models that incorporates differential privacy with only a single privacy budget. Future work should explore parameter optimization, alternative privacy mechanisms, broader applications of the framework to diverse datasets and domains, and collaborations with clinicians for clinical utility evaluation. This study provides a foundation for improving synthetic tabular data evaluation and advancing privacy-preserving data sharing in healthcare.

  • Research Article
  • Cite Count Icon 1
  • 10.1016/j.jbi.2025.104939
A comprehensive evaluation framework for synthetic medical tabular data generation.
  • Nov 1, 2025
  • Journal of biomedical informatics
  • Anastasia Kurakova + 1 more

A comprehensive evaluation framework for synthetic medical tabular data generation.

Save Icon
Up Arrow
Open/Close
Notes

Save Important notes in documents

Highlight text to save as a note, or write notes directly

You can also access these Documents in Paperpal, our AI writing tool

Powered by our AI Writing Assistant