Exploring the Potential of Synthetic Data to Replace Real Data
The potential of synthetic data to replace real data creates a huge demand for synthetic data in data-hungry AI. This potential is even greater when synthetic data is used for training along with a small number of real images from domains other than the test domain. We find that this potential varies depending on (i) the number of cross-domain real images and (ii) the test set on which the trained model is evaluated. We introduce two new metrics, the train2test distance and AP t2t , to evaluate the ability of a cross-domain training set using synthetic data to represent the characteristics of test instances in relation to training performance. Using these metrics, we delve deeper into the factors that influence the potential of synthetic data and uncover some interesting dynamics about how synthetic data impacts training performance. We hope these discoveries will encourage more widespread use of synthetic data.
- Preprint Article
- 10.2196/preprints.71364
- Jan 16, 2025
BACKGROUND High-quality, large-scale healthcare research, especially those using medical records, encounters significant challenges related to technical difficulties and confidentiality issues. As a result, critical research questions about patient evaluation and treatment have been left unanswered. Moreover, the presence of stigma and increased sensitivity surrounding mental health issues have resulted in a significant delay in research progress, particularly concerning Child and Adolescent Mental Health Services (CAMHS). OBJECTIVE These challenges can be effectively addressed by generating synthetic data, which not only safeguard individual privacy but also facilitate comprehensive analyses of clinical information from EMRs and other clinical data sources. To exemplify this method, we have utilized CAMHS synthetic data for planning the allocation of mental health resources, while ensuring confidentiality. In the process, using mental health clinical data, we demonstrate how to create and successfully analyse synthetic data from large-scale EMR-based data to answer critical health care questions for policymakers and clinicians. METHODS The study was carried out on a retrospectively collected cohort comprising 6,924 distinct patients from the Child and Adolescent Mental Health Services (CAMHS) in Stavanger, Norway. The analysis included 7,730 referral periods and a total of 58,524 episodes of care. The full dataset was divided into a training cohort (n = 6184 referrals, 58524 episodes of care) and an independent, fixed test set (n = 1564 referrals, 14,610 episodes of care). A hierarchical synthetic data generation model was used to generate synthetic referral periods with the associated episodes of care based on “real-world” CAMHS data. In addition to the utility of the data, the quality and privacy risk of the generated synthetic data were assessed. RESULTS The synthetic hierarchical data generation model created reproducible synthetic CAMHS data with properties very similar to “real-world” data (KS/TVD Complement score =0.92, CS score =0.77, CS (Inter-table) score =0.75 and CSS score=0.92), while demonstrating low risk score when exposed to a set of privacy attacks (average Singleout score(univariate)=0.17, average Singleout score(multivariate)=0.04, average Linkability risk=2.5, average inference risk=0.7). The predictive model trained on synthetic data produced comparable performance to the model trained on real data in the context of classifying the intensity of care required by patients, all while maintaining the interpretability of the utilized features. (for n = 656, 1546, 3092 and 6184, average PR_AUC = 0.32, 0.33, 0.34 and 0.40 respectively, compared to PR_AUC =0.43 when using n=6184 real data records). CONCLUSIONS Synthetic data in Child and Adolescent Mental Health Services (CAMHS) balances data utility with fairness and privacy protection.It fosters trust between patients and healthcare providers while promoting collaboration among researchers by offering access to extensive and representative samples with a low risk of patient identification. This approach not only encourages data sharing but also expands the breadth of research while safeguarding patient privacy. Effective implementation of synthetic data generation methods in CAMHS depends on the model's ability to accurately identify and replicate the complex patterns present in real data, while maintaining consistency across various outputs. Therefore, selecting the appropriate technique is crucial for achieving accurate and insightful research findings in this field CLINICALTRIAL The synthetic hierarchical data generation model created reproducible synthetic CAMHS data with properties very similar to “real-world” data (for n = 656 ,KS/TVD Complement score =0.92, CS score =0.77, CS (Inter-table) score =0.75 and CSS score=0.92), while demonstrating low risk score when exposed to a set of privacy attacks (for n = 656, average Singleout score(univariate)=0.17, average Singleout score(multivariate)=0.04, average Linkability risk=2.5, average inference risk=0.7). The predictive model trained on synthetic data produced comparable performance to the model trained on real data in the context of classifying the intensity of care required by patients, all while maintaining the interpretability of the utilized features. (for n = 656, 1546, 3092 and 6184, average PR_AUC = 0.32, 0.33, 0.34 and 0.40 respectively, compared to PR_AUC =0.43 when using n=6184 real data records).
- Research Article
1
- 10.69554/lqom5698
- Jun 1, 2024
- Journal of Data Protection & Privacy
This paper explores the potential applications of high-fidelity synthetic patient data in the context of healthcare research, including challenges and benefits. The paper starts by defining synthetic data, types of synthetic data and approaches to generating synthetic data. It then discusses the potential applications of synthetic data in addition to as a privacy enhancing technology and current debates around whether synthetic data should be considered personal data and,therefore, should be subjected to privacy controls to minimise reidentification risks. This will be followed by a discussion of privacy preservation approaches and privacy metrics that can be applied in the context of synthetic data. The paper includes a case study based on synthetic electronic healthcare record data from the Clinical Practice Research Datalink on how privacy concerns due to reidentification have been addressed in order to make this data available for research purposes. The authors conclude that synthetic data, particularly high-fidelity synthetic patient data, has the potential to add value over and above real data for public health and that it is possible to address privacy concerns to make synthetic data available via a combination of privacy measures applied during the synthetic data generation process and post-generation reidentification risk assessments as part of data protection impact assessments.
- Research Article
4
- 10.1002/pds.70019
- Oct 1, 2024
- Pharmacoepidemiology and drug safety
To assess the validity of privacy-preserving synthetic data by comparing results from synthetic versus original EHR data analysis. A published retrospective cohort study on real-world effectiveness of COVID-19 vaccines by Maccabi Healthcare Services in Israel was replicated using synthetic data generated from the same source, and the results were compared between synthetic versus original datasets. The endpoints included COVID-19 infection, symptomatic COVID-19 infection and hospitalization due to infection and were also assessed in several demographic and clinical subgroups. In comparing synthetic versus original data estimates, several metrices were utilized: standardized mean differences (SMD), decision agreement, estimate agreement, confidence interval overlap, and Wald test. Synthetic data were generated five times to assess the stability of results. The distribution of demographic and clinical characteristics demonstrated very small difference (< 0.01 SMD). In the comparison of vaccine effectiveness assessed in relative risk reduction between synthetic versus original data, there was a 100% decision agreement, 100% estimate agreement, and a high level of confidence interval overlap (88.7%-99.7%) in all five replicates across all subgroups. Similar findings were achieved in the assessment of vaccine effectiveness against symptomatic COVID-19 Infection. In the comparison of hazard ratios for COVID 19-related hospitalization and odds ratio for symptomatic COVID-19 Infection, the Wald tests suggested no significant difference between respective effect estimates in all five replicates for all patient subgroups but there were disagreements in estimate and decision metrices in some subgroups and replicates. Overall, comparison of synthetic versus original real-world data demonstrated good validity and reliability. Transparency on the process to generate high fidelity synthetic data and assurances of patient privacy are warranted.
- Research Article
2
- 10.3233/shti240490
- Aug 22, 2024
- Studies in health technology and informatics
The sharing of personal health data is highly regulated due to privacy and security concerns. An alternative to sharing personal data is to share synthetic data, because ideally it should be impossible to reconstruct real personal data from synthetic data, which is called privacy. At the same time, the structure of the synthetic data should be as similar as possible to the structure of the real data to ensure that conclusions drawn from the synthetic data are also valid for the real data, which is called fidelity. Typically, there is a tradeoff between fidelity and privacy for synthetic health data. We study the fidelity and privacy of cancer data synthesized using generative machine learning approaches. To generate synthetic cancer data, we use variational autoencoders (VAEs), generative adversarial networks (GANs), and denoising diffusion probabilistic models (DDPMs). The tabular cancer registry data studied have nine categorical variables from breast cancer patients. We find that DDPMs generate synthetic cancer data with higher fidelity; that is, the structure of the synthetic data is more similar to the real cancer data than the data generated by VAEs and GANs. At the same time, synthetic cancer data from DDPMs pose a greater privacy risk because the data are more likely to reveal information from real patients than synthetic data from VAEs and GANs.
- Abstract
2
- 10.1182/blood-2022-168646
- Nov 15, 2022
- Blood
Synthetic Data Generation By Artificial Intelligence to Accelerate Translational Research and Precision Medicine in Hematological Malignancies
- Research Article
3
- 10.3171/2025.4.focus25225
- Jul 1, 2025
- Neurosurgical focus
Use of neurosurgical data for clinical research and machine learning (ML) model development is often limited by data availability, sample sizes, and regulatory constraints. Synthetic data offer a potential solution to challenges associated with accessing, sharing, and using real-world data (RWD). The aim of this study was to evaluate the capability of generating synthetic neurosurgical data with a generative adversarial network and large language model (LLM) to augment RWD, perform secondary analyses in place of RWD, and train an ML model to predict postoperative outcomes. Synthetic data were generated with a conditional tabular generative adversarial network (CTGAN) and the LLM GPT-4o based on a real-world neurosurgical dataset of 140 older adults who underwent neurosurgical interventions. Each model was used to generate datasets at equivalent (n = 140) and amplified (n = 1000) sample sizes. Data fidelity was evaluated by comparing univariate and bivariate statistics to the RWD. Privacy evaluation involved measuring the uniqueness of generated synthetic records. Utility was assessed by: 1) reproducing and extending clinical analyses on predictors of Karnofsky Performance Status (KPS) deterioration at discharge and a prolonged postoperative intensive care unit (ICU) stay, and 2) training a binary ML classifier on amplified synthetic datasets to predict KPS deterioration on RWD. Both the CTGAN and GPT-4o generated complete, high-fidelity synthetic tabular datasets. GPT-4o matched or exceeded CTGAN across all measured fidelity, utility, and privacy metrics. All significant clinical predictors of KPS deterioration and prolonged ICU stay were retained in the GPT-4o-generated synthetic data, with some differences observed in effect sizes. Preoperative KPS was not preserved as a significant predictor in the CTGAN-generated data. The ML classifier trained on GPT-4o data outperformed the model trained on CTGAN data, achieving a higher F1 score (0.725 vs 0.688) for predicting KPS deterioration. This study demonstrated a promising ability to produce high-fidelity synthetic neurosurgical data using generative models. Synthetic neurosurgical data present a potential solution to critical limitations in data availability for neurosurgical research. Further investigation is necessary to enhance synthetic data utility for secondary analyses and ML model training, and to evaluate synthetic data generation methods across other datasets, including clinical trial data.
- Research Article
3
- 10.12688/f1000research.155230.2
- Jan 2, 2025
- F1000Research
Synthetic data's utility in benchmark studies depends on its ability to closely mimic real-world conditions and reproduce results obtained from experimental data. Building on Nearing et al.'s study (1), who assessed 14 differential abundance tests using 38 experimental 16S rRNA datasets in a case-control design, we are generating synthetic datasets that mimic the experimental data to verify their findings. We will employ statistical tests to rigorously assess the similarity between synthetic and experimental data and to validate the conclusions on the performance of these tests drawn by Nearing et al. (1). This protocol adheres to the SPIRIT guidelines, demonstrating how established reporting frameworks can support robust, transparent, and unbiased study planning. We replicate Nearing et al.'s (1) methodology, incorporating synthetic data simulated using two distinct tools, mirroring the 38 experimental datasets. Equivalence tests will be conducted on a non-redundant subset of 46 data characteristics comparing synthetic and experimental data, complemented by principal component analysis for overall similarity assessment. The 14 differential abundance tests will be applied to synthetic and experimental datasets, evaluating the consistency of significant feature identification and the number of significant features per tool. Correlation analysis and multiple regression will explore how differences between synthetic and experimental data characteristics may affect the results. Synthetic data enables the validation of findings through controlled experiments. We assess how well synthetic data replicates experimental data, try to validate previous findings with the most recent versions of the DA methods and delineate the strengths and limitations of synthetic data in benchmark studies. Moreover, to our knowledge this is the first computational benchmark study to systematically incorporate synthetic data for validating differential abundance methods while strictly adhering to a pre-specified study protocol following SPIRIT guidelines, contributing to transparency, reproducibility, and unbiased research.
- Research Article
7
- 10.1190/tle41060392.1
- Jun 1, 2022
- The Leading Edge
This paper discusses the generation of synthetic 3D seismic data for training neural networks to solve a variety of seismic processing, interpretation, and inversion tasks. Using synthetic data is a way to address the shortage of seismic data, which are required for solving problems with machine learning techniques. Synthetic data are built via a simulation process that is based on a mathematical representation of the physics of the problem. In other words, using synthetic data is an indirect way to teach neural networks about the physics of the problem. An important incentive for using synthetic data to solve problems with artificial intelligence methods is that with real seismic data the ground truth is always unknown. When generating synthetic seismic data, we first build the model and then calculate the data, so the answer (model) is always known and always exact. We describe a methodology for generating on-the-fly simulated postmigration (1D modeling) synthetic data in 3D, which are high resolution and look similar to real data. A wide range of models is covered by generating an unlimited number of data examples. The synthetic data are built from impedance models that are constructed through geostatistical simulation of real well logs. With geostatistical simulation, we can describe various geologic variance models in 3D and obtain realistic images. To cover a broad range of scenarios, we need to generalize the seismic data story by randomly perturbing many parameters including structures, conformity styles, dip-strike directions, variograms, measured input logs, frequencies, phase spectra, etc.
- Research Article
3
- 10.3390/s24092750
- Apr 25, 2024
- Sensors
Biometric authentication plays a vital role in various everyday applications with increasing demands for reliability and security. However, the use of real biometric data for research raises privacy concerns and data scarcity issues. A promising approach using synthetic biometric data to address the resulting unbalanced representation and bias, as well as the limited availability of diverse datasets for the development and evaluation of biometric systems, has emerged. Methods for a parameterized generation of highly realistic synthetic data are emerging and the necessary quality metrics to prove that synthetic data can compare to real data are open research tasks. The generation of 3D synthetic face data using game engines' capabilities of generating varied realistic virtual characters is explored as a possible alternative for generating synthetic face data while maintaining reproducibility and ground truth, as opposed to other creation methods. While synthetic data offer several benefits, including improved resilience against data privacy concerns, the limitations and challenges associated with their usage are addressed. Our work shows concurrent behavior in comparing semi-synthetic data as a digital representation of a real identity with their real datasets. Despite slight asymmetrical performance in comparison with a larger database of real samples, a promising performance in face data authentication is shown, which lays the foundation for further investigations with digital avatars and the creation and analysis of fully synthetic data. Future directions for improving synthetic biometric data generation and their impact on advancing biometrics research are discussed.
- Research Article
1
- 10.12688/f1000research.155230.1
- Oct 9, 2024
- F1000Research
Background The utility of synthetic data in benchmark studies depends on its ability to closely mimic real-world conditions and to reproduce results obtained from experimental data. Here, we evaluate the performance of differential abundance tests for 16S metagenomic data. Building on the benchmark study by Nearing et al. (1), who assessed 14 differential abundance tests using 38 experimental datasets in a case-control design, we validate their findings by generating synthetic datasets that mimics the experimental data. We will employ statistical tests to rigorously assess the similarity between synthetic and experimental data and to validate the conclusions on the performance of these tests drawn by Nearing et al. (1). This protocol adheres to the SPIRIT guidelines and is, to our knowledge, the first of its kind in computational benchmark studies. Methods We replicate Nearing et al.’s (1) methodology, incorporating synthetic data simulated using two distinct tools, mirroring each of the 38 experimental datasets. Equivalence tests will be conducted on 43 data characteristics comparing synthetic and experimental data, complemented by principal component analysis for overall similarity assessment. The 14 differential abundance tests will be applied to both synthetic and experimental datasets, evaluating the consistency of significant feature identification and the number of significant features per tool. Correlation analysis and multiple regression will explore how differences between synthetic and experimental data characteristics may affect the results. Conclusions Synthetic data enables the validation of findings through controlled experiments. We assess how well synthetic data replicates experimental data, validate previous findings and delineate the strengths and limitations of synthetic data in benchmark studies. Moreover, to our knowledge this is the first computational benchmark study to systematically incorporate synthetic data for validating differential abundance methods while strictly adhering to a pre-specified study protocol following SPIRIT guidelines, contributing significantly to transparency, reproducibility, and unbiased research.
- Abstract
1
- 10.1182/blood-2022-171057
- Nov 15, 2022
- Blood
Systematic Evaluation of Synthetic Panel Data Quality with an Application to Chronic Lymphocytic Leukemia
- Research Article
9
- 10.52756/ijerr.2023.v30.004
- Apr 30, 2023
- International Journal of Experimental Research and Review
Generative Adversarial Network (GAN) is a revolution in modern artificial systems. Deep learning-based Generative adversarial networks generate realistic synthetic tabular data. Synthetic data are used to enhance the size of a relatively small training dataset while ensuring the confidentiality of the original data. In this context, we implemented the GAN framework for generating diabetes data to help the health care professional in more clinical applications. GAN is used to validate the Pima Indian Diabetes (PID) Dataset. Various preprocessing techniques, such as handling missing values, outliers and data imbalance problems, enhance data quality. Some exploratory data analyses, such as heat maps, bar graphs and histograms, are used for data visualisation. We employed hypothesis testing to examine the resemblance between real data and GAN-generated synthetic data. In this study, we proposed a GAN-Long Short-Term Memory (GLSTM) system, in which GAN is used for data augmentation, and LSTM is used for diabetes classification. Additionally, various GAN models such as CTGAN, Vanilla GAN, Coupula GAN, Gaussian Coupula GAN, and TVAE GAN are used to generate the synthetic dataset. Experiments were conducted on real data, synthetic data, and by combining real and synthetic data. The model that used both real and synthetic data obtained a substantially better accuracy of 97% compared to 92% when only real data was used. We also observed that synthetic data could be used in place of real data, as the mean correlation between synthetic and real data is 0.93. Our study's findings outperformed when compared to state-of-the-art methodologies.
- Research Article
- 10.21203/rs.3.rs-8497559/v1
- Jan 29, 2026
- Research Square
BackgroundSynthetic health data offers a promising means of sharing clinical information without compromising patient privacy. However, existing methods often produce outputs that differ in structure from real data and are evaluated in narrow contexts, limiting their practical use in downstream analytical workflows. This study introduces a pipeline that builds upon existing methods for generating realistic synthetic longitudinal electronic health record data, evaluates it across three diverse datasets, and offers evidence-based guidance on the use of synthetic data to replace or augment real data.MethodsThe pipeline extends existing state of the art HALO and ConSequence frameworks with a post-processing step that reconstructs continuous variables and timestamps, producing synthetic data that closely matches the structure of real medical record datasets. It was applied to three clinically diverse datasets: a small longitudinal cohort, a medium-sized intensive-care dataset, and a very large multi-hospital administrative dataset. Realism was assessed alongside utility for machine learning, statistical modelling, and time series analysis tasks.ResultsAcross all datasets, the pipeline generated realistic synthetic data that preserved key statistical properties and relationships. Machine learning models trained on synthetic data achieved similar predictive accuracy and feature importance patterns to those trained on real data, indicating strong utility. Synthetic data also performed well in statistical modelling, with the direction and magnitude of effects generally closely aligned with the real data. However, it may be less suitable when precise estimates are required or when modelling relatively rare conditions. Importantly, although the pipeline reconstructed timestamp structures, it did not capture aggregate temporal patterns and the resulting data was therefore unsuitable for time series analysis.ConclusionsThe pipeline produces realistic and analytically useful synthetic longitudinal electronic health record data across datasets of widely varying scales. These findings provide practical guidance on when synthetic data can meaningfully substitute for or complement real data.
- Research Article
137
- 10.2196/16492
- Feb 20, 2020
- JMIR Medical Informatics
BackgroundPrivacy restrictions limit access to protected patient-derived health information for research purposes. Consequently, data anonymization is required to allow researchers data access for initial analysis before granting institutional review board approval. A system installed and activated at our institution enables synthetic data generation that mimics data from real electronic medical records, wherein only fictitious patients are listed.ObjectiveThis paper aimed to validate the results obtained when analyzing synthetic structured data for medical research. A comprehensive validation process concerning meaningful clinical questions and various types of data was conducted to assess the accuracy and precision of statistical estimates derived from synthetic patient data.MethodsA cross-hospital project was conducted to validate results obtained from synthetic data produced for five contemporary studies on various topics. For each study, results derived from synthetic data were compared with those based on real data. In addition, repeatedly generated synthetic datasets were used to estimate the bias and stability of results obtained from synthetic data.ResultsThis study demonstrated that results derived from synthetic data were predictive of results from real data. When the number of patients was large relative to the number of variables used, highly accurate and strongly consistent results were observed between synthetic and real data. For studies based on smaller populations that accounted for confounders and modifiers by multivariate models, predictions were of moderate accuracy, yet clear trends were correctly observed.ConclusionsThe use of synthetic structured data provides a close estimate to real data results and is thus a powerful tool in shaping research hypotheses and accessing estimated analyses, without risking patient privacy. Synthetic data enable broad access to data (eg, for out-of-organization researchers), and rapid, safe, and repeatable analysis of data in hospitals or other health organizations where patient privacy is a primary value.
- Abstract
- 10.1182/blood-2024-209541
- Nov 5, 2024
- Blood
Generation of Multimodal Longitudinal Synthetic Data By Artificial Intelligence to Improve Personalized Medicine in Hematology