Analyzing Medical Research Results Based on Synthetic Data and Their Relation to Real Data Results: Systematic Comparison From Five Observational Studies
BackgroundPrivacy restrictions limit access to protected patient-derived health information for research purposes. Consequently, data anonymization is required to allow researchers data access for initial analysis before granting institutional review board approval. A system installed and activated at our institution enables synthetic data generation that mimics data from real electronic medical records, wherein only fictitious patients are listed.ObjectiveThis paper aimed to validate the results obtained when analyzing synthetic structured data for medical research. A comprehensive validation process concerning meaningful clinical questions and various types of data was conducted to assess the accuracy and precision of statistical estimates derived from synthetic patient data.MethodsA cross-hospital project was conducted to validate results obtained from synthetic data produced for five contemporary studies on various topics. For each study, results derived from synthetic data were compared with those based on real data. In addition, repeatedly generated synthetic datasets were used to estimate the bias and stability of results obtained from synthetic data.ResultsThis study demonstrated that results derived from synthetic data were predictive of results from real data. When the number of patients was large relative to the number of variables used, highly accurate and strongly consistent results were observed between synthetic and real data. For studies based on smaller populations that accounted for confounders and modifiers by multivariate models, predictions were of moderate accuracy, yet clear trends were correctly observed.ConclusionsThe use of synthetic structured data provides a close estimate to real data results and is thus a powerful tool in shaping research hypotheses and accessing estimated analyses, without risking patient privacy. Synthetic data enable broad access to data (eg, for out-of-organization researchers), and rapid, safe, and repeatable analysis of data in hospitals or other health organizations where patient privacy is a primary value.
- Research Article
1
- 10.69554/lqom5698
- Jun 1, 2024
- Journal of Data Protection & Privacy
This paper explores the potential applications of high-fidelity synthetic patient data in the context of healthcare research, including challenges and benefits. The paper starts by defining synthetic data, types of synthetic data and approaches to generating synthetic data. It then discusses the potential applications of synthetic data in addition to as a privacy enhancing technology and current debates around whether synthetic data should be considered personal data and,therefore, should be subjected to privacy controls to minimise reidentification risks. This will be followed by a discussion of privacy preservation approaches and privacy metrics that can be applied in the context of synthetic data. The paper includes a case study based on synthetic electronic healthcare record data from the Clinical Practice Research Datalink on how privacy concerns due to reidentification have been addressed in order to make this data available for research purposes. The authors conclude that synthetic data, particularly high-fidelity synthetic patient data, has the potential to add value over and above real data for public health and that it is possible to address privacy concerns to make synthetic data available via a combination of privacy measures applied during the synthetic data generation process and post-generation reidentification risk assessments as part of data protection impact assessments.
- Abstract
2
- 10.1182/blood-2022-168646
- Nov 15, 2022
- Blood
Synthetic Data Generation By Artificial Intelligence to Accelerate Translational Research and Precision Medicine in Hematological Malignancies
- Preprint Article
- 10.2196/preprints.71364
- Jan 16, 2025
BACKGROUND High-quality, large-scale healthcare research, especially those using medical records, encounters significant challenges related to technical difficulties and confidentiality issues. As a result, critical research questions about patient evaluation and treatment have been left unanswered. Moreover, the presence of stigma and increased sensitivity surrounding mental health issues have resulted in a significant delay in research progress, particularly concerning Child and Adolescent Mental Health Services (CAMHS). OBJECTIVE These challenges can be effectively addressed by generating synthetic data, which not only safeguard individual privacy but also facilitate comprehensive analyses of clinical information from EMRs and other clinical data sources. To exemplify this method, we have utilized CAMHS synthetic data for planning the allocation of mental health resources, while ensuring confidentiality. In the process, using mental health clinical data, we demonstrate how to create and successfully analyse synthetic data from large-scale EMR-based data to answer critical health care questions for policymakers and clinicians. METHODS The study was carried out on a retrospectively collected cohort comprising 6,924 distinct patients from the Child and Adolescent Mental Health Services (CAMHS) in Stavanger, Norway. The analysis included 7,730 referral periods and a total of 58,524 episodes of care. The full dataset was divided into a training cohort (n = 6184 referrals, 58524 episodes of care) and an independent, fixed test set (n = 1564 referrals, 14,610 episodes of care). A hierarchical synthetic data generation model was used to generate synthetic referral periods with the associated episodes of care based on “real-world” CAMHS data. In addition to the utility of the data, the quality and privacy risk of the generated synthetic data were assessed. RESULTS The synthetic hierarchical data generation model created reproducible synthetic CAMHS data with properties very similar to “real-world” data (KS/TVD Complement score =0.92, CS score =0.77, CS (Inter-table) score =0.75 and CSS score=0.92), while demonstrating low risk score when exposed to a set of privacy attacks (average Singleout score(univariate)=0.17, average Singleout score(multivariate)=0.04, average Linkability risk=2.5, average inference risk=0.7). The predictive model trained on synthetic data produced comparable performance to the model trained on real data in the context of classifying the intensity of care required by patients, all while maintaining the interpretability of the utilized features. (for n = 656, 1546, 3092 and 6184, average PR_AUC = 0.32, 0.33, 0.34 and 0.40 respectively, compared to PR_AUC =0.43 when using n=6184 real data records). CONCLUSIONS Synthetic data in Child and Adolescent Mental Health Services (CAMHS) balances data utility with fairness and privacy protection.It fosters trust between patients and healthcare providers while promoting collaboration among researchers by offering access to extensive and representative samples with a low risk of patient identification. This approach not only encourages data sharing but also expands the breadth of research while safeguarding patient privacy. Effective implementation of synthetic data generation methods in CAMHS depends on the model's ability to accurately identify and replicate the complex patterns present in real data, while maintaining consistency across various outputs. Therefore, selecting the appropriate technique is crucial for achieving accurate and insightful research findings in this field CLINICALTRIAL The synthetic hierarchical data generation model created reproducible synthetic CAMHS data with properties very similar to “real-world” data (for n = 656 ,KS/TVD Complement score =0.92, CS score =0.77, CS (Inter-table) score =0.75 and CSS score=0.92), while demonstrating low risk score when exposed to a set of privacy attacks (for n = 656, average Singleout score(univariate)=0.17, average Singleout score(multivariate)=0.04, average Linkability risk=2.5, average inference risk=0.7). The predictive model trained on synthetic data produced comparable performance to the model trained on real data in the context of classifying the intensity of care required by patients, all while maintaining the interpretability of the utilized features. (for n = 656, 1546, 3092 and 6184, average PR_AUC = 0.32, 0.33, 0.34 and 0.40 respectively, compared to PR_AUC =0.43 when using n=6184 real data records).
- Research Article
- 10.21203/rs.3.rs-8497559/v1
- Jan 29, 2026
- Research Square
BackgroundSynthetic health data offers a promising means of sharing clinical information without compromising patient privacy. However, existing methods often produce outputs that differ in structure from real data and are evaluated in narrow contexts, limiting their practical use in downstream analytical workflows. This study introduces a pipeline that builds upon existing methods for generating realistic synthetic longitudinal electronic health record data, evaluates it across three diverse datasets, and offers evidence-based guidance on the use of synthetic data to replace or augment real data.MethodsThe pipeline extends existing state of the art HALO and ConSequence frameworks with a post-processing step that reconstructs continuous variables and timestamps, producing synthetic data that closely matches the structure of real medical record datasets. It was applied to three clinically diverse datasets: a small longitudinal cohort, a medium-sized intensive-care dataset, and a very large multi-hospital administrative dataset. Realism was assessed alongside utility for machine learning, statistical modelling, and time series analysis tasks.ResultsAcross all datasets, the pipeline generated realistic synthetic data that preserved key statistical properties and relationships. Machine learning models trained on synthetic data achieved similar predictive accuracy and feature importance patterns to those trained on real data, indicating strong utility. Synthetic data also performed well in statistical modelling, with the direction and magnitude of effects generally closely aligned with the real data. However, it may be less suitable when precise estimates are required or when modelling relatively rare conditions. Importantly, although the pipeline reconstructed timestamp structures, it did not capture aggregate temporal patterns and the resulting data was therefore unsuitable for time series analysis.ConclusionsThe pipeline produces realistic and analytically useful synthetic longitudinal electronic health record data across datasets of widely varying scales. These findings provide practical guidance on when synthetic data can meaningfully substitute for or complement real data.
- Research Article
1
- 10.1200/jco.2024.42.16_suppl.e13627
- Jun 1, 2024
- Journal of Clinical Oncology
e13627 Background: The analysis of genomic variants is crucial in precision oncology research, offering insights into cancer risks and progression, especially in diverse types such as lung adenocarcinoma (LUAD). However, such research often grapples with balancing patient privacy with the need for comprehensive, high-quality genomic datasets. Our project addresses this by creating synthetic clinical-genomic data, which maintains patient confidentiality and provides a rich resource for genomic cancer research. Methods: Leveraging the GuardantINFORM database, which includes anonymized genomic data and structured payer claims, we focused on generating synthetic data for LUAD patient cohorts. This approach involves processing real patient data into a format compatible with Medisyn’s generative AI models, ensuring the synthetic data retains the original's statistical properties, and processing the output back into the original database structure and format. This method plays a crucial role in maintaining patient privacy and serves as a valuable tool for research by enabling the generation of realistic patients with desired properties on demand. Results: Our synthetic data closely mirrors real-world genomic and claims variable distributions, evidenced by a 0.994 R2 correlation between real and synthetic data along with comparable Oncoprints. Importantly, privacy tests show that patient confidentiality is effectively maintained despite this effective performance. The synthetic data's utility was then demonstrated in a study replicating real-world findings: LUAD patients with KRAS G12C in combination with STK11 mutations showed a significantly higher risk of early mortality. This underscores the potential of synthetic data in advancing cancer research. Conclusions: This research offers a promising avenue for the cancer research community. By providing a method to share privatized, synthetic genomic data, which can be combined and generated on demand, we enable broader, more responsible data sharing. This approach protects patient privacy and offers a rich dataset for groundbreaking research, potentially accelerating advances in cancer diagnosis and treatment. [Table: see text]
- Research Article
4
- 10.1002/pds.70019
- Oct 1, 2024
- Pharmacoepidemiology and drug safety
To assess the validity of privacy-preserving synthetic data by comparing results from synthetic versus original EHR data analysis. A published retrospective cohort study on real-world effectiveness of COVID-19 vaccines by Maccabi Healthcare Services in Israel was replicated using synthetic data generated from the same source, and the results were compared between synthetic versus original datasets. The endpoints included COVID-19 infection, symptomatic COVID-19 infection and hospitalization due to infection and were also assessed in several demographic and clinical subgroups. In comparing synthetic versus original data estimates, several metrices were utilized: standardized mean differences (SMD), decision agreement, estimate agreement, confidence interval overlap, and Wald test. Synthetic data were generated five times to assess the stability of results. The distribution of demographic and clinical characteristics demonstrated very small difference (< 0.01 SMD). In the comparison of vaccine effectiveness assessed in relative risk reduction between synthetic versus original data, there was a 100% decision agreement, 100% estimate agreement, and a high level of confidence interval overlap (88.7%-99.7%) in all five replicates across all subgroups. Similar findings were achieved in the assessment of vaccine effectiveness against symptomatic COVID-19 Infection. In the comparison of hazard ratios for COVID 19-related hospitalization and odds ratio for symptomatic COVID-19 Infection, the Wald tests suggested no significant difference between respective effect estimates in all five replicates for all patient subgroups but there were disagreements in estimate and decision metrices in some subgroups and replicates. Overall, comparison of synthetic versus original real-world data demonstrated good validity and reliability. Transparency on the process to generate high fidelity synthetic data and assurances of patient privacy are warranted.
- Conference Article
5
- 10.1109/icmla.2018.00166
- Dec 1, 2018
Patient data are regarded as highly sensitive and protected information by federal, state and local policies that make it available to only those who have been given access to Protected Health Information (PHI). In many applications, the access to PHI and real patient data can be substituted with generated realistic synthetic data used instead of real patient data. While methods exist that can generate synthetic data, it is unclear how to evaluate synthetic data quality. The objective of this paper is to present investigation of a new method for statistically testing the quality of synthetic patient data. Weighted Itemsets Error (WIE) measure compares frequent itemsets in the synthetic data with expected itemsets in real data, thus allowing for evaluating cooccurrence of data items. The derived measure is tested in the context of synthetic data comprising of medical diagnoses. The results demonstrate the effects of parameters that control WIE measure, and indicate that WIE is a simple yet powerful approach for evaluating synthetic datasets.
- Research Article
24
- 10.1200/cci.23.00116
- Sep 1, 2023
- JCO Clinical Cancer Informatics
PURPOSEThere is strong interest from patients, researchers, the pharmaceutical industry, medical journal editors, funders of research, and regulators in sharing clinical trial data for secondary analysis. However, data access remains a challenge because of concerns about patient privacy. It has been argued that synthetic data generation (SDG) is an effective way to address these privacy concerns. There is a dearth of evidence supporting this on oncology clinical trial data sets, and on the utility of privacy-preserving synthetic data. The objective of the proposed study is to validate the utility and privacy risks of synthetic clinical trial data sets across multiple SDG techniques.METHODSWe synthesized data sets from eight breast cancer clinical trial data sets using three types of generative models: sequential synthesis, conditional generative adversarial network, and variational autoencoder. Synthetic data utility was evaluated by replicating the published analyses on the synthetic data and assessing concordance of effect estimates and CIs between real and synthetic data. Privacy was evaluated by measuring attribution disclosure risk and membership disclosure risk.RESULTSUtility was highest using the sequential synthesis method where all results were replicable and the CI overlap most similar or higher for seven of eight data sets. Both types of privacy risks were low across all three types of generative models.DISCUSSIONSynthetic data using sequential synthesis methods can act as a proxy for real clinical trial data sets, and simultaneously have low privacy risks. This type of generative model can be one way to enable broader sharing of clinical trial data.
- Abstract
1
- 10.1182/blood-2022-171057
- Nov 15, 2022
- Blood
Systematic Evaluation of Synthetic Panel Data Quality with an Application to Chronic Lymphocytic Leukemia
- Research Article
- 10.63278/jicrcr.vi.3646
- Jan 5, 2026
- Journal of International Crisis and Risk Communication Research
Healthcare Quality Engineering teams face a critical challenge in validating claims processing systems. HIPAA regulations and organizational security policies restrict access to production data containing Protected Health Information. Traditional data masking techniques reduce contextual accuracy. This results in incomplete testing coverage and missed defects. Synthetic test data generation offers a compliant and privacy-preserving solution for testing X12 EDI transactions. Properly engineered synthetic EDI data reflects real clinical and billing behavior without exposing patient identities. This article examines the role of synthetic test data in healthcare claims Quality Engineering. It explores the challenges addressed by synthetic data generation. It analyzes strategies for creating high-quality synthetic EDI datasets that maintain statistical accuracy and structural integrity. Implementation considerations for enterprise Quality Engineering pipelines receive detailed attention. Business outcomes demonstrate substantial improvements in test automation coverage and release velocity. PHI-related compliance risk diminishes significantly with synthetic data adoption. The article discusses future advancements, including generative AI applications and metadata-driven dataset assembly. Synthetic EDI test data represents a foundational capability for healthcare organizations navigating the balance between innovation and security.
- Research Article
- 10.1158/1538-7445.am2019-1641
- Jul 1, 2019
- Cancer Research
While machine learning (ML) has shown some promise in medical research, its actual impact has been limited relative to other application domains. One reason for this disparity is the lack of high-quality, patient-level data available to the broader ML research community. Such datasets are often not made available due to protections around patient privacy. To overcome these obstacles, high-quality, synthetic datasets could be leveraged to accelerate methodological developments in the application of ML to biomedical research. Clinical data in the form of electronic health records present a rich data source to be used for synthetic data generation. Such data can be high dimensional and predominantly categorical, which poses multiple challenges from a modeling perspective. In this paper, we evaluate four classes of synthetic data generation techniques, as well as several metrics for evaluating the quality of the synthetic data. While the results and discussions are broadly applicable to medical data, for demonstration purposes we generate synthetic datasets from the publicly available Surveillance Epidemiology and End Results (SEER) program. Specifically, our cohort consists of breast cancer cases diagnosed in the year of 2010, which includes over 26000 individual cases. Finally, we discuss the trade-offs of the different methods and metrics, providing guidance on considerations for the generation and usage of synthetic medical data. Citation Format: Andre R. Goncalves, Priyadip Ray, Braden Soper, Madhumita Myneni, Jennifer L. Stevens, Linda M. Coyle, Ana Paula Sales. Generation and evaluation of medical synthetic data [abstract]. In: Proceedings of the American Association for Cancer Research Annual Meeting 2019; 2019 Mar 29-Apr 3; Atlanta, GA. Philadelphia (PA): AACR; Cancer Res 2019;79(13 Suppl):Abstract nr 1641.
- Research Article
66
- 10.1111/coin.12427
- Jan 3, 2021
- Computational Intelligence
Electronic healthcare record data have been used to study risk factors of disease, treatment effectiveness and safety, and to inform healthcare service planning. There has been increasing interest in utilizing these data for new purposes such as for machine learning to develop predictive algorithms to aid diagnostic and treatment decisions. Synthetic data could potentially be an alternative to real‐world data for these purposes as well as reveal any biases in the data used for algorithm development. This article discusses the key requirements of synthetic data for multiple purposes and proposes an approach to generate and evaluate synthetic data focused on, but not limited to, cross‐sectional healthcare data. To our knowledge, this is the first article to propose a framework to generate and evaluate synthetic healthcare data with the aim of simultaneously preserving the complexities of ground truth data in the synthetic data while also ensuring privacy. We include findings and new insights from synthetic datasets modeled on both the Indian liver patient dataset and UK primary care dataset to demonstrate the application of this framework under different scenarios.
- Research Article
19
- 10.3389/frai.2022.918813
- Sep 14, 2022
- Frontiers in Artificial Intelligence
In the past decade, there has been exponentially growing interest in the use of observational data collected as a part of routine healthcare practice to determine the effect of a treatment with causal inference models. Validation of these models, however, has been a challenge because the ground truth is unknown: only one treatment-outcome pair for each person can be observed. There have been multiple efforts to fill this void using synthetic data where the ground truth can be generated. However, to date, these datasets have been severely limited in their utility either by being modeled after small non-representative patient populations, being dissimilar to real target populations, or only providing known effects for two cohorts (treated vs. control). In this work, we produced a large-scale and realistic synthetic dataset that provides ground truth effects for over 10 hypertension treatments on blood pressure outcomes. The synthetic dataset was created by modeling a nationwide cohort of more than 580, 000 hypertension patient data including each person's multi-year history of diagnoses, medications, and laboratory values. We designed a data generation process by combining an adapted ADS-GAN model for fictitious patient information generation and a neural network for treatment outcome generation. Wasserstein distance of 0.35 demonstrates that our synthetic data follows a nearly identical joint distribution to the patient cohort used to generate the data. Patient privacy was a primary concern for this study; the ϵ-identifiability metric, which estimates the probability of actual patients being identified, is 0.008%, ensuring that our synthetic data cannot be used to identify any actual patients. To demonstrate its usage, we tested the bias in causal effect estimation of four well-established models using this dataset. The approach we used can be readily extended to other types of diseases in the clinical domain, and to datasets in other domains as well.
- Research Article
102
- 10.1136/bmjopen-2020-043497
- Apr 1, 2021
- BMJ Open
ObjectivesThere are increasing requirements to make research data, especially clinical trial data, more broadly available for secondary analyses. However, data availability remains a challenge due to complex privacy requirements. This...
- Research Article
3
- 10.1182/blood-2024-203356
- Nov 5, 2024
- Blood
Using Synthetic Data Produced By Artificial Intelligence (AI) to Generate Insights for Chimeric Antigen Receptor T-Cell (CAR T-Cell) Clinical Trials