High-fidelity synthetic patient data applications and privacy considerations

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon

This paper explores the potential applications of high-fidelity synthetic patient data in the context of healthcare research, including challenges and benefits. The paper starts by defining synthetic data, types of synthetic data and approaches to generating synthetic data. It then discusses the potential applications of synthetic data in addition to as a privacy enhancing technology and current debates around whether synthetic data should be considered personal data and,therefore, should be subjected to privacy controls to minimise reidentification risks. This will be followed by a discussion of privacy preservation approaches and privacy metrics that can be applied in the context of synthetic data. The paper includes a case study based on synthetic electronic healthcare record data from the Clinical Practice Research Datalink on how privacy concerns due to reidentification have been addressed in order to make this data available for research purposes. The authors conclude that synthetic data, particularly high-fidelity synthetic patient data, has the potential to add value over and above real data for public health and that it is possible to address privacy concerns to make synthetic data available via a combination of privacy measures applied during the synthetic data generation process and post-generation reidentification risk assessments as part of data protection impact assessments.

Similar Papers
  • Abstract
  • 10.23889/ijpds.v7i3.1984
Barriers and facilitators to generating synthetic administrative data for research.
  • Aug 25, 2022
  • International Journal of Population Data Science
  • Theodora Kokosi + 4 more

ObjectivesGeneration of synthetic data could improve the efficiency of administrative data analysis. We describe barriers and facilitators to synthetic administrative data in the UK based on our experience of generating, assessing, and evaluating the performance of different approaches. We aim to provide guidance on the appropriate uses of synthetic administrative data. ApproachWe generated synthetic versions of one large-population survey (Natsal-3) and two administrative datasets (Hospital Episode Statistics [HES] and National Pupil Database [NPD]). A range of methods were used based on the statistical techniques of sampling and prediction. We implemented non-parametric (e.g., Classification and Regression Tree) and parametric (e.g., generalised linear models) methods, and multiple imputation and Bayesian networks in R software. We attempted to generate low- and high-fidelity datasets and assessed utility by visualising marginal distributions of key variables, estimating the standardised propensity mean square error, and deriving standardised coefficient differences of model estimates and overlap of confidence intervals. ResultsResults from our analysis highlighted some facilitators related to low-fidelity synthetic data that are quicker to generate, can retain the data types, format, and privacy and could be used to support training and code development. Conversely, some of the barriers included computational issues when generating high-fidelity synthetic data from complex data structures. High-fidelity data are achievable but only in the context of a specific research question and a limited number of variables. Results from the Natsal-3 data showed that parametric methods produced slightly better data utility compared to non-parametric methods. Results for HES and NPD will also be presented. ConclusionsLow-fidelity synthetic data can provide a useful resource to support users of administrative data, whilst minimising data access timelines and while retaining privacy and confidentiality of personal data. High-utility datasets can be generated but take considerable resources, and current approaches cannot fully handle the complexity of longitudinal administrative data.

  • Research Article
  • Cite Count Icon 4
  • 10.3171/2025.4.focus25225
Synthetic neurosurgical data generation with generative adversarial networks and large language models:an investigation on fidelity, utility, and privacy.
  • Jul 1, 2025
  • Neurosurgical focus
  • Austin A Barr + 3 more

Use of neurosurgical data for clinical research and machine learning (ML) model development is often limited by data availability, sample sizes, and regulatory constraints. Synthetic data offer a potential solution to challenges associated with accessing, sharing, and using real-world data (RWD). The aim of this study was to evaluate the capability of generating synthetic neurosurgical data with a generative adversarial network and large language model (LLM) to augment RWD, perform secondary analyses in place of RWD, and train an ML model to predict postoperative outcomes. Synthetic data were generated with a conditional tabular generative adversarial network (CTGAN) and the LLM GPT-4o based on a real-world neurosurgical dataset of 140 older adults who underwent neurosurgical interventions. Each model was used to generate datasets at equivalent (n = 140) and amplified (n = 1000) sample sizes. Data fidelity was evaluated by comparing univariate and bivariate statistics to the RWD. Privacy evaluation involved measuring the uniqueness of generated synthetic records. Utility was assessed by: 1) reproducing and extending clinical analyses on predictors of Karnofsky Performance Status (KPS) deterioration at discharge and a prolonged postoperative intensive care unit (ICU) stay, and 2) training a binary ML classifier on amplified synthetic datasets to predict KPS deterioration on RWD. Both the CTGAN and GPT-4o generated complete, high-fidelity synthetic tabular datasets. GPT-4o matched or exceeded CTGAN across all measured fidelity, utility, and privacy metrics. All significant clinical predictors of KPS deterioration and prolonged ICU stay were retained in the GPT-4o-generated synthetic data, with some differences observed in effect sizes. Preoperative KPS was not preserved as a significant predictor in the CTGAN-generated data. The ML classifier trained on GPT-4o data outperformed the model trained on CTGAN data, achieving a higher F1 score (0.725 vs 0.688) for predicting KPS deterioration. This study demonstrated a promising ability to produce high-fidelity synthetic neurosurgical data using generative models. Synthetic neurosurgical data present a potential solution to critical limitations in data availability for neurosurgical research. Further investigation is necessary to enhance synthetic data utility for secondary analyses and ML model training, and to evaluate synthetic data generation methods across other datasets, including clinical trial data.

  • Research Article
  • Cite Count Icon 2
  • 10.3233/shti240490
On the Fidelity-Privacy Tradeoff of Synthetic Cancer Registry Data.
  • Aug 22, 2024
  • Studies in health technology and informatics
  • Philipp Röchner

The sharing of personal health data is highly regulated due to privacy and security concerns. An alternative to sharing personal data is to share synthetic data, because ideally it should be impossible to reconstruct real personal data from synthetic data, which is called privacy. At the same time, the structure of the synthetic data should be as similar as possible to the structure of the real data to ensure that conclusions drawn from the synthetic data are also valid for the real data, which is called fidelity. Typically, there is a tradeoff between fidelity and privacy for synthetic health data. We study the fidelity and privacy of cancer data synthesized using generative machine learning approaches. To generate synthetic cancer data, we use variational autoencoders (VAEs), generative adversarial networks (GANs), and denoising diffusion probabilistic models (DDPMs). The tabular cancer registry data studied have nine categorical variables from breast cancer patients. We find that DDPMs generate synthetic cancer data with higher fidelity; that is, the structure of the synthetic data is more similar to the real cancer data than the data generated by VAEs and GANs. At the same time, synthetic cancer data from DDPMs pose a greater privacy risk because the data are more likely to reveal information from real patients than synthetic data from VAEs and GANs.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 68
  • 10.1038/s41746-023-00771-5
Patient-centric synthetic data generation, no reason to risk re-identification in biomedical data analysis
  • Mar 10, 2023
  • NPJ Digital Medicine
  • Morgan Guillaudeux + 10 more

While nearly all computational methods operate on pseudonymized personal data, re-identification remains a risk. With personal health data, this re-identification risk may be considered a double-crossing of patients’ trust. Herein, we present a new method to generate synthetic data of individual granularity while holding on to patients’ privacy. Developed for sensitive biomedical data, the method is patient-centric as it uses a local model to generate random new synthetic data, called an “avatar data”, for each initial sensitive individual. This method, compared with 2 other synthetic data generation techniques (Synthpop, CT-GAN), is applied to real health data with a clinical trial and a cancer observational study to evaluate the protection it provides while retaining the original statistical information. Compared to Synthpop and CT-GAN, the Avatar method shows a similar level of signal maintenance while allowing to compute additional privacy metrics. In the light of distance-based privacy metrics, each individual produces an avatar simulation that is on average indistinguishable from 12 other generated avatar simulations for the clinical trial and 24 for the observational study. Data transformation using the Avatar method both preserves, the evaluation of the treatment’s effectiveness with similar hazard ratios for the clinical trial (original HR = 0.49 [95% CI, 0.39–0.63] vs. avatar HR = 0.40 [95% CI, 0.31–0.52]) and the classification properties for the observational study (original AUC = 99.46 (s.e. 0.25) vs. avatar AUC = 99.84 (s.e. 0.12)). Once validated by privacy metrics, anonymous synthetic data enable the creation of value from sensitive pseudonymized data analyses by tackling the risk of a privacy breach.

  • Preprint Article
  • 10.2196/preprints.71364
Synthetic Data in Child and Adolescent Mental Health Service Research: A Tool Whose Time has Come. (Preprint)
  • Jan 16, 2025
  • Mounir Haizoune

BACKGROUND High-quality, large-scale healthcare research, especially those using medical records, encounters significant challenges related to technical difficulties and confidentiality issues. As a result, critical research questions about patient evaluation and treatment have been left unanswered. Moreover, the presence of stigma and increased sensitivity surrounding mental health issues have resulted in a significant delay in research progress, particularly concerning Child and Adolescent Mental Health Services (CAMHS). OBJECTIVE These challenges can be effectively addressed by generating synthetic data, which not only safeguard individual privacy but also facilitate comprehensive analyses of clinical information from EMRs and other clinical data sources. To exemplify this method, we have utilized CAMHS synthetic data for planning the allocation of mental health resources, while ensuring confidentiality. In the process, using mental health clinical data, we demonstrate how to create and successfully analyse synthetic data from large-scale EMR-based data to answer critical health care questions for policymakers and clinicians. METHODS The study was carried out on a retrospectively collected cohort comprising 6,924 distinct patients from the Child and Adolescent Mental Health Services (CAMHS) in Stavanger, Norway. The analysis included 7,730 referral periods and a total of 58,524 episodes of care. The full dataset was divided into a training cohort (n = 6184 referrals, 58524 episodes of care) and an independent, fixed test set (n = 1564 referrals, 14,610 episodes of care). A hierarchical synthetic data generation model was used to generate synthetic referral periods with the associated episodes of care based on “real-world” CAMHS data. In addition to the utility of the data, the quality and privacy risk of the generated synthetic data were assessed. RESULTS The synthetic hierarchical data generation model created reproducible synthetic CAMHS data with properties very similar to “real-world” data (KS/TVD Complement score =0.92, CS score =0.77, CS (Inter-table) score =0.75 and CSS score=0.92), while demonstrating low risk score when exposed to a set of privacy attacks (average Singleout score(univariate)=0.17, average Singleout score(multivariate)=0.04, average Linkability risk=2.5, average inference risk=0.7). The predictive model trained on synthetic data produced comparable performance to the model trained on real data in the context of classifying the intensity of care required by patients, all while maintaining the interpretability of the utilized features. (for n = 656, 1546, 3092 and 6184, average PR_AUC = 0.32, 0.33, 0.34 and 0.40 respectively, compared to PR_AUC =0.43 when using n=6184 real data records). CONCLUSIONS Synthetic data in Child and Adolescent Mental Health Services (CAMHS) balances data utility with fairness and privacy protection.It fosters trust between patients and healthcare providers while promoting collaboration among researchers by offering access to extensive and representative samples with a low risk of patient identification. This approach not only encourages data sharing but also expands the breadth of research while safeguarding patient privacy. Effective implementation of synthetic data generation methods in CAMHS depends on the model's ability to accurately identify and replicate the complex patterns present in real data, while maintaining consistency across various outputs. Therefore, selecting the appropriate technique is crucial for achieving accurate and insightful research findings in this field CLINICALTRIAL The synthetic hierarchical data generation model created reproducible synthetic CAMHS data with properties very similar to “real-world” data (for n = 656 ,KS/TVD Complement score =0.92, CS score =0.77, CS (Inter-table) score =0.75 and CSS score=0.92), while demonstrating low risk score when exposed to a set of privacy attacks (for n = 656, average Singleout score(univariate)=0.17, average Singleout score(multivariate)=0.04, average Linkability risk=2.5, average inference risk=0.7). The predictive model trained on synthetic data produced comparable performance to the model trained on real data in the context of classifying the intensity of care required by patients, all while maintaining the interpretability of the utilized features. (for n = 656, 1546, 3092 and 6184, average PR_AUC = 0.32, 0.33, 0.34 and 0.40 respectively, compared to PR_AUC =0.43 when using n=6184 real data records).

  • Abstract
  • Cite Count Icon 2
  • 10.1182/blood-2022-168646
Synthetic Data Generation By Artificial Intelligence to Accelerate Translational Research and Precision Medicine in Hematological Malignancies
  • Nov 15, 2022
  • Blood
  • Saverio D'Amico + 19 more

Synthetic Data Generation By Artificial Intelligence to Accelerate Translational Research and Precision Medicine in Hematological Malignancies

  • Research Article
  • 10.1182/blood-2025-4350
Development and validation of synthetic data generation over a federated learning computing framework to accelerate innovation and boost personalized medicine in hematological diseases
  • Nov 3, 2025
  • Blood
  • Gianluca Asti + 37 more

Development and validation of synthetic data generation over a federated learning computing framework to accelerate innovation and boost personalized medicine in hematological diseases

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 210
  • 10.1038/s41746-020-00353-9
Generating high-fidelity synthetic patient data for assessing machine learning healthcare software
  • Nov 9, 2020
  • npj Digital Medicine
  • Allan Tucker + 3 more

There is a growing demand for the uptake of modern artificial intelligence technologies within healthcare systems. Many of these technologies exploit historical patient health data to build powerful predictive models that can be used to improve diagnosis and understanding of disease. However, there are many issues concerning patient privacy that need to be accounted for in order to enable this data to be better harnessed by all sectors. One approach that could offer a method of circumventing privacy issues is the creation of realistic synthetic data sets that capture as many of the complexities of the original data set (distributions, non-linear relationships, and noise) but that does not actually include any real patient data. While previous research has explored models for generating synthetic data sets, here we explore the integration of resampling, probabilistic graphical modelling, latent variable identification, and outlier analysis for producing realistic synthetic data based on UK primary care patient data. In particular, we focus on handling missingness, complex interactions between variables, and the resulting sensitivity analysis statistics from machine learning classifiers, while quantifying the risks of patient re-identification from synthetic datapoints. We show that, through our approach of integrating outlier analysis with graphical modelling and resampling, we can achieve synthetic data sets that are not significantly different from original ground truth data in terms of feature distributions, feature dependencies, and sensitivity analysis statistics when inferring machine learning classifiers. What is more, the risk of generating synthetic data that is identical or very similar to real patients is shown to be low.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 138
  • 10.2196/16492
Analyzing Medical Research Results Based on Synthetic Data and Their Relation to Real Data Results: Systematic Comparison From Five Observational Studies
  • Feb 20, 2020
  • JMIR Medical Informatics
  • Anat Reiner Benaim + 11 more

BackgroundPrivacy restrictions limit access to protected patient-derived health information for research purposes. Consequently, data anonymization is required to allow researchers data access for initial analysis before granting institutional review board approval. A system installed and activated at our institution enables synthetic data generation that mimics data from real electronic medical records, wherein only fictitious patients are listed.ObjectiveThis paper aimed to validate the results obtained when analyzing synthetic structured data for medical research. A comprehensive validation process concerning meaningful clinical questions and various types of data was conducted to assess the accuracy and precision of statistical estimates derived from synthetic patient data.MethodsA cross-hospital project was conducted to validate results obtained from synthetic data produced for five contemporary studies on various topics. For each study, results derived from synthetic data were compared with those based on real data. In addition, repeatedly generated synthetic datasets were used to estimate the bias and stability of results obtained from synthetic data.ResultsThis study demonstrated that results derived from synthetic data were predictive of results from real data. When the number of patients was large relative to the number of variables used, highly accurate and strongly consistent results were observed between synthetic and real data. For studies based on smaller populations that accounted for confounders and modifiers by multivariate models, predictions were of moderate accuracy, yet clear trends were correctly observed.ConclusionsThe use of synthetic structured data provides a close estimate to real data results and is thus a powerful tool in shaping research hypotheses and accessing estimated analyses, without risking patient privacy. Synthetic data enable broad access to data (eg, for out-of-organization researchers), and rapid, safe, and repeatable analysis of data in hospitals or other health organizations where patient privacy is a primary value.

  • Research Article
  • Cite Count Icon 2
  • 10.2196/53241
Creating High-Quality Synthetic Health Data: Framework for Model Development and Validation.
  • Apr 22, 2024
  • JMIR Formative Research
  • Elnaz Karimian Sichani + 3 more

Electronic health records are a valuable source of patient information that must be properly deidentified before being shared with researchers. This process requires expertise and time. In addition, synthetic data have considerably reduced the restrictions on the use and sharing of real data, allowing researchers to access it more rapidly with far fewer privacy constraints. Therefore, there has been a growing interest in establishing a method to generate synthetic data that protects patients' privacy while properly reflecting the data. This study aims to develop and validate a model that generates valuable synthetic longitudinal health data while protecting the privacy of the patients whose data are collected. We investigated the best model for generating synthetic health data, with a focus on longitudinal observations. We developed a generative model that relies on the generalized canonical polyadic (GCP) tensor decomposition. This model also involves sampling from a latent factor matrix of GCP decomposition, which contains patient factors, using sequential decision trees, copula, and Hamiltonian Monte Carlo methods. We applied the proposed model to samples from the MIMIC-III (version 1.4) data set. Numerous analyses and experiments were conducted with different data structures and scenarios. We assessed the similarity between our synthetic data and the real data by conducting utility assessments. These assessments evaluate the structure and general patterns present in the data, such as dependency structure, descriptive statistics, and marginal distributions. Regarding privacy disclosure, our model preserves privacy by preventing the direct sharing of patient information and eliminating the one-to-one link between the observed and model tensor records. This was achieved by simulating and modeling a latent factor matrix of GCP decomposition associated with patients. The findings show that our model is a promising method for generating synthetic longitudinal health data that is similar enough to real data. It can preserve the utility and privacy of the original data while also handling various data structures and scenarios. In certain experiments, all simulation methods used in the model produced the same high level of performance. Our model is also capable of addressing the challenge of sampling patients from electronic health records. This means that we can simulate a variety of patients in the synthetic data set, which may differ in number from the patients in the original data. We have presented a generative model for producing synthetic longitudinal health data. The model is formulated by applying the GCP tensor decomposition. We have provided 3 approaches for the synthesis and simulation of a latent factor matrix following the process of factorization. In brief, we have reduced the challenge of synthesizing massive longitudinal health data to synthesizing a nonlongitudinal and significantly smaller data set.

  • Research Article
  • Cite Count Icon 4
  • 10.1002/pds.70019
Validation Assessment of Privacy-Preserving Synthetic Electronic Health Record Data: Comparison of Original Versus Synthetic Data on Real-World COVID-19 Vaccine Effectiveness.
  • Oct 1, 2024
  • Pharmacoepidemiology and drug safety
  • Echo Wang + 5 more

To assess the validity of privacy-preserving synthetic data by comparing results from synthetic versus original EHR data analysis. A published retrospective cohort study on real-world effectiveness of COVID-19 vaccines by Maccabi Healthcare Services in Israel was replicated using synthetic data generated from the same source, and the results were compared between synthetic versus original datasets. The endpoints included COVID-19 infection, symptomatic COVID-19 infection and hospitalization due to infection and were also assessed in several demographic and clinical subgroups. In comparing synthetic versus original data estimates, several metrices were utilized: standardized mean differences (SMD), decision agreement, estimate agreement, confidence interval overlap, and Wald test. Synthetic data were generated five times to assess the stability of results. The distribution of demographic and clinical characteristics demonstrated very small difference (< 0.01 SMD). In the comparison of vaccine effectiveness assessed in relative risk reduction between synthetic versus original data, there was a 100% decision agreement, 100% estimate agreement, and a high level of confidence interval overlap (88.7%-99.7%) in all five replicates across all subgroups. Similar findings were achieved in the assessment of vaccine effectiveness against symptomatic COVID-19 Infection. In the comparison of hazard ratios for COVID 19-related hospitalization and odds ratio for symptomatic COVID-19 Infection, the Wald tests suggested no significant difference between respective effect estimates in all five replicates for all patient subgroups but there were disagreements in estimate and decision metrices in some subgroups and replicates. Overall, comparison of synthetic versus original real-world data demonstrated good validity and reliability. Transparency on the process to generate high fidelity synthetic data and assurances of patient privacy are warranted.

  • Research Article
  • Cite Count Icon 1
  • 10.1200/jco.2024.42.16_suppl.e13627
AI-generated synthetic clinical-genomic data for precision oncology research: Validation using a case study on lung adenocarcinoma.
  • Jun 1, 2024
  • Journal of Clinical Oncology
  • Brandon Theodorou + 6 more

e13627 Background: The analysis of genomic variants is crucial in precision oncology research, offering insights into cancer risks and progression, especially in diverse types such as lung adenocarcinoma (LUAD). However, such research often grapples with balancing patient privacy with the need for comprehensive, high-quality genomic datasets. Our project addresses this by creating synthetic clinical-genomic data, which maintains patient confidentiality and provides a rich resource for genomic cancer research. Methods: Leveraging the GuardantINFORM database, which includes anonymized genomic data and structured payer claims, we focused on generating synthetic data for LUAD patient cohorts. This approach involves processing real patient data into a format compatible with Medisyn’s generative AI models, ensuring the synthetic data retains the original's statistical properties, and processing the output back into the original database structure and format. This method plays a crucial role in maintaining patient privacy and serves as a valuable tool for research by enabling the generation of realistic patients with desired properties on demand. Results: Our synthetic data closely mirrors real-world genomic and claims variable distributions, evidenced by a 0.994 R2 correlation between real and synthetic data along with comparable Oncoprints. Importantly, privacy tests show that patient confidentiality is effectively maintained despite this effective performance. The synthetic data's utility was then demonstrated in a study replicating real-world findings: LUAD patients with KRAS G12C in combination with STK11 mutations showed a significantly higher risk of early mortality. This underscores the potential of synthetic data in advancing cancer research. Conclusions: This research offers a promising avenue for the cancer research community. By providing a method to share privatized, synthetic genomic data, which can be combined and generated on demand, we enable broader, more responsible data sharing. This approach protects patient privacy and offers a rich dataset for groundbreaking research, potentially accelerating advances in cancer diagnosis and treatment. [Table: see text]

  • Research Article
  • Cite Count Icon 70
  • 10.1016/j.jbi.2021.103977
Membership inference attacks against synthetic health data
  • Dec 14, 2021
  • Journal of Biomedical Informatics
  • Ziqi Zhang + 2 more

Membership inference attacks against synthetic health data

  • Research Article
  • Cite Count Icon 66
  • 10.1016/j.jbi.2023.104404
Generating synthetic personal health data using conditional generative adversarial networks combining with differential privacy
  • Jun 1, 2023
  • Journal of Biomedical Informatics
  • Chang Sun + 2 more

A large amount of personal health data that is highly valuable to the scientific community is still not accessible or requires a lengthy request process due to privacy concerns and legal restrictions. As a solution, synthetic data has been studied and proposed to be a promising alternative to this issue. However, generating realistic and privacy-preserving synthetic personal health data retains challenges such as simulating the characteristics of the patients’ data that are in the minority classes, capturing the relations among variables in imbalanced data and transferring them to the synthetic data, and preserving individual patients’ privacy. In this paper, we propose a differentially private conditional Generative Adversarial Network model (DP-CGANS) consisting of data transformation, sampling, conditioning, and network training to generate realistic and privacy-preserving personal data. Our model distinguishes categorical and continuous variables and transforms them into latent space separately for better training performance. We tackle the unique challenges of generating synthetic patient data due to the special data characteristics of personal health data. For example, patients with a certain disease are typically the minority in the dataset and the relations among variables are crucial to be observed. Our model is structured with a conditional vector as an additional input to present the minority class in the imbalanced data and maximally capture the dependency between variables. Moreover, we inject statistical noise into the gradients in the networking training process of DP-CGANS to provide a differential privacy guarantee. We extensively evaluate our model with state-of-the-art generative models on personal socio-economic datasets and real-world personal health datasets in terms of statistical similarity, machine learning performance, and privacy measurement. We demonstrate that our model outperforms other comparable models, especially in capturing the dependence between variables. Finally, we present the balance between data utility and privacy in synthetic data generation considering the different data structures and characteristics of real-world personal health data such as imbalanced classes, abnormal distributions, and data sparsity.

  • Research Article
  • 10.21203/rs.3.rs-8497559/v1
A novel pipeline for realistic synthetic longitudinal EHR data generation
  • Jan 29, 2026
  • Research Square
  • Gabrielle Josling + 2 more

BackgroundSynthetic health data offers a promising means of sharing clinical information without compromising patient privacy. However, existing methods often produce outputs that differ in structure from real data and are evaluated in narrow contexts, limiting their practical use in downstream analytical workflows. This study introduces a pipeline that builds upon existing methods for generating realistic synthetic longitudinal electronic health record data, evaluates it across three diverse datasets, and offers evidence-based guidance on the use of synthetic data to replace or augment real data.MethodsThe pipeline extends existing state of the art HALO and ConSequence frameworks with a post-processing step that reconstructs continuous variables and timestamps, producing synthetic data that closely matches the structure of real medical record datasets. It was applied to three clinically diverse datasets: a small longitudinal cohort, a medium-sized intensive-care dataset, and a very large multi-hospital administrative dataset. Realism was assessed alongside utility for machine learning, statistical modelling, and time series analysis tasks.ResultsAcross all datasets, the pipeline generated realistic synthetic data that preserved key statistical properties and relationships. Machine learning models trained on synthetic data achieved similar predictive accuracy and feature importance patterns to those trained on real data, indicating strong utility. Synthetic data also performed well in statistical modelling, with the direction and magnitude of effects generally closely aligned with the real data. However, it may be less suitable when precise estimates are required or when modelling relatively rare conditions. Importantly, although the pipeline reconstructed timestamp structures, it did not capture aggregate temporal patterns and the resulting data was therefore unsuitable for time series analysis.ConclusionsThe pipeline produces realistic and analytically useful synthetic longitudinal electronic health record data across datasets of widely varying scales. These findings provide practical guidance on when synthetic data can meaningfully substitute for or complement real data.

Save Icon
Up Arrow
Open/Close
Notes

Save Important notes in documents

Highlight text to save as a note, or write notes directly

You can also access these Documents in Paperpal, our AI writing tool

Powered by our AI Writing Assistant