Accelerate Literature Icon
Want to do a literature review? Try our new Literature Review workflow

Computational Study Protocol: Leveraging Synthetic Data to Validate a Benchmark Study for Differential Abundance Tests for 16S Microbiome Sequencing Data.

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon

Synthetic data's utility in benchmark studies depends on its ability to closely mimic real-world conditions and reproduce results obtained from experimental data. Building on Nearing et al.'s study (1), who assessed 14 differential abundance tests using 38 experimental 16S rRNA datasets in a case-control design, we are generating synthetic datasets that mimic the experimental data to verify their findings. We will employ statistical tests to rigorously assess the similarity between synthetic and experimental data and to validate the conclusions on the performance of these tests drawn by Nearing et al. (1). This protocol adheres to the SPIRIT guidelines, demonstrating how established reporting frameworks can support robust, transparent, and unbiased study planning. We replicate Nearing et al.'s (1) methodology, incorporating synthetic data simulated using two distinct tools, mirroring the 38 experimental datasets. Equivalence tests will be conducted on a non-redundant subset of 46 data characteristics comparing synthetic and experimental data, complemented by principal component analysis for overall similarity assessment. The 14 differential abundance tests will be applied to synthetic and experimental datasets, evaluating the consistency of significant feature identification and the number of significant features per tool. Correlation analysis and multiple regression will explore how differences between synthetic and experimental data characteristics may affect the results. Synthetic data enables the validation of findings through controlled experiments. We assess how well synthetic data replicates experimental data, try to validate previous findings with the most recent versions of the DA methods and delineate the strengths and limitations of synthetic data in benchmark studies. Moreover, to our knowledge this is the first computational benchmark study to systematically incorporate synthetic data for validating differential abundance methods while strictly adhering to a pre-specified study protocol following SPIRIT guidelines, contributing to transparency, reproducibility, and unbiased research.

Similar Papers
  • Research Article
  • Cite Count Icon 1
  • 10.12688/f1000research.155230.1
Computational Study Protocol: Leveraging Synthetic Data to Validate a Benchmark Study for Differential Abundance Tests for 16S Microbiome Sequencing Data
  • Oct 9, 2024
  • F1000Research
  • Eva Kohnert + 1 more

Background The utility of synthetic data in benchmark studies depends on its ability to closely mimic real-world conditions and to reproduce results obtained from experimental data. Here, we evaluate the performance of differential abundance tests for 16S metagenomic data. Building on the benchmark study by Nearing et al. (1), who assessed 14 differential abundance tests using 38 experimental datasets in a case-control design, we validate their findings by generating synthetic datasets that mimics the experimental data. We will employ statistical tests to rigorously assess the similarity between synthetic and experimental data and to validate the conclusions on the performance of these tests drawn by Nearing et al. (1). This protocol adheres to the SPIRIT guidelines and is, to our knowledge, the first of its kind in computational benchmark studies. Methods We replicate Nearing et al.’s (1) methodology, incorporating synthetic data simulated using two distinct tools, mirroring each of the 38 experimental datasets. Equivalence tests will be conducted on 43 data characteristics comparing synthetic and experimental data, complemented by principal component analysis for overall similarity assessment. The 14 differential abundance tests will be applied to both synthetic and experimental datasets, evaluating the consistency of significant feature identification and the number of significant features per tool. Correlation analysis and multiple regression will explore how differences between synthetic and experimental data characteristics may affect the results. Conclusions Synthetic data enables the validation of findings through controlled experiments. We assess how well synthetic data replicates experimental data, validate previous findings and delineate the strengths and limitations of synthetic data in benchmark studies. Moreover, to our knowledge this is the first computational benchmark study to systematically incorporate synthetic data for validating differential abundance methods while strictly adhering to a pre-specified study protocol following SPIRIT guidelines, contributing significantly to transparency, reproducibility, and unbiased research.

  • Research Article
  • 10.12688/f1000research.163152.1
Leveraging Synthetic Data to Validate a Benchmark Study for Differential Abundance Tests for 16S Microbiome Sequencing Data
  • Jun 25, 2025
  • F1000Research
  • Eva Kohnert + 1 more

Background Synthetic data’s utility in benchmark studies depends on its ability to closely mimic real-world conditions and reproduce results from experimental data. Building on Nearing et al.‘s study (1), which assessed 14 differential abundance tests using 38 experimental 16S rRNA datasets in a case-control design, we generated synthetic datasets to verify these findings. We rigorously assessed the similarity between synthetic and experimental data and validated the conclusions on the performance of these tests drawn by Nearing et al. (1). This study adheres to the study protocol: Computational Study Protocol: Leveraging Synthetic Data to Validate a Benchmark Study for Differential Abundance Tests for 16S Microbiome Sequencing Data (2). Methods We replicated Nearing et al.’s (1) methodology, incorporating simulated data using two distinct tools (metaSPARSim (3) and sparseDOSSA2 (4)), mirroring the 38 experimental datasets. Equivalence tests were conducted on a set of 30 data characteristics (DC) comparing synthetic and experimental data, complemented by principal component analysis for overall similarity assessment. The 14 differential abundance tests were applied to synthetic datasets, evaluating the consistency of significant feature identification and the proportion of significant features per tool. Correlation analysis, multiple regression and decision trees were used to explore how differences between synthetic and experimental DCs may affect the results. Conclusions Adhering to a formal study protocol in computational benchmarking studies is crucial for ensuring transparency and minimizing bias, though it comes with challenges, including significant effort required for planning, execution, and documentation. In this study, metaSPARSim (3) and sparseDOSSA2 (4) successfully generated synthetic data mirroring the experimental templates, validating trends in differential abundance tests. Of 27 hypotheses tested, 6 were fully validated, with similar trends for 37%. While hypothesis testing remains challenging, especially when translating qualitative observations from text into testable hypotheses, synthetic data for validation and benchmarking shows great promise for future research.

  • Abstract
  • Cite Count Icon 2
  • 10.1182/blood-2022-168646
Synthetic Data Generation By Artificial Intelligence to Accelerate Translational Research and Precision Medicine in Hematological Malignancies
  • Nov 15, 2022
  • Blood
  • Saverio D'Amico + 19 more

Synthetic Data Generation By Artificial Intelligence to Accelerate Translational Research and Precision Medicine in Hematological Malignancies

  • Research Article
  • Cite Count Icon 2
  • 10.2196/53241
Creating High-Quality Synthetic Health Data: Framework for Model Development and Validation.
  • Apr 22, 2024
  • JMIR Formative Research
  • Elnaz Karimian Sichani + 3 more

Electronic health records are a valuable source of patient information that must be properly deidentified before being shared with researchers. This process requires expertise and time. In addition, synthetic data have considerably reduced the restrictions on the use and sharing of real data, allowing researchers to access it more rapidly with far fewer privacy constraints. Therefore, there has been a growing interest in establishing a method to generate synthetic data that protects patients' privacy while properly reflecting the data. This study aims to develop and validate a model that generates valuable synthetic longitudinal health data while protecting the privacy of the patients whose data are collected. We investigated the best model for generating synthetic health data, with a focus on longitudinal observations. We developed a generative model that relies on the generalized canonical polyadic (GCP) tensor decomposition. This model also involves sampling from a latent factor matrix of GCP decomposition, which contains patient factors, using sequential decision trees, copula, and Hamiltonian Monte Carlo methods. We applied the proposed model to samples from the MIMIC-III (version 1.4) data set. Numerous analyses and experiments were conducted with different data structures and scenarios. We assessed the similarity between our synthetic data and the real data by conducting utility assessments. These assessments evaluate the structure and general patterns present in the data, such as dependency structure, descriptive statistics, and marginal distributions. Regarding privacy disclosure, our model preserves privacy by preventing the direct sharing of patient information and eliminating the one-to-one link between the observed and model tensor records. This was achieved by simulating and modeling a latent factor matrix of GCP decomposition associated with patients. The findings show that our model is a promising method for generating synthetic longitudinal health data that is similar enough to real data. It can preserve the utility and privacy of the original data while also handling various data structures and scenarios. In certain experiments, all simulation methods used in the model produced the same high level of performance. Our model is also capable of addressing the challenge of sampling patients from electronic health records. This means that we can simulate a variety of patients in the synthetic data set, which may differ in number from the patients in the original data. We have presented a generative model for producing synthetic longitudinal health data. The model is formulated by applying the GCP tensor decomposition. We have provided 3 approaches for the synthesis and simulation of a latent factor matrix following the process of factorization. In brief, we have reduced the challenge of synthesizing massive longitudinal health data to synthesizing a nonlongitudinal and significantly smaller data set.

  • Conference Article
  • 10.54941/ahfe1006801
Data Synthetization and Feature Analysis: A Study in Bladder Cancer Recurrence Data
  • Jan 1, 2025
  • AHFE international
  • Sandi Baressi Šegota + 7 more

The application of synthetic data within the biomedical domain is rapidly gaining momentum, driven by the growing need for robust datasets suitable for machine learning (ML) and statistical modeling. In scenarios where access to real patient data is limited due to privacy concerns or scarcity, synthetic data offers an attractive alternative. These artificially generated datasets aim to mimic the statistical characteristics of original data, enabling researchers to conduct exploratory analysis, develop predictive models, or validate findings without compromising patient confidentiality. However, the increasing use of synthetic data raises several methodological and interpretative challenges, particularly regarding the correct sequence and context for applying statistical analyses. One of the central issues identified in contemporary literature concerns the timing of data analysis relative to the synthetic data generation process. Some studies conduct statistical or ML analyses directly on real datasets and use synthetic data for validation or augmentation. Others, conversely, perform all stages of analysis including feature importance estimation, correlation assessment, and model training on synthetic data. This inconsistency raises the question of whether statistical analysis conducted solely on synthetic datasets yields reliable insights, or whether it constitutes a methodological flaw. The prevailing assumption is that analysis should ideally be performed on real data to preserve statistical integrity, but empirical evaluation of this notion remains limited. In the current study, the authors address this issue by applying a synthetic data generation method specifically, the Tabular Variational Auto encoder (TVAE) to a biomedical dataset focused on bladder cancer recurrence. This dataset includes various diagnostic variables, and the primary goal is to assess how well synthetic data replicates analytical insights drawn from the original data. To achieve this, the authors conduct both correlational analysis and machine learning-based feature importance estimation. The results derived from synthetic datasets of varying sizes are then compared to those obtained from the original data. The findings indicate that while synthetic data can approximate general trends observed in the original dataset, there are notable differences depending on the analytical technique employed. In particular, models such as Random Forest appear more sensitive to variations introduced during the synthetization process. This sensitivity manifests as shifts in feature importance rankings and variability in predictive performance, especially when working with smaller synthetic datasets. On the other hand, simpler statistical methods such as correlation coefficients display more stability, suggesting that some analytical approaches may be more robust to data generation artifacts than others. These observations underscore the importance of methodological caution when interpreting results based on synthetic biomedical data. While synthetic datasets hold considerable promise for advancing data-driven research in biomedicine, they are not a one-size-fits-all solution. The sequence in which synthetic data is introduced into the research pipeline whether before or after statistical analysis—can significantly influence the validity of the findings. As such, researchers must critically assess the suitability of synthetic data for specific analytical tasks and ensure transparency in reporting their methodological choices. Future work should further explore the impact of different generative models and dataset properties on the reliability of synthetic-data-driven insights.

  • Research Article
  • Cite Count Icon 4
  • 10.3171/2025.4.focus25225
Synthetic neurosurgical data generation with generative adversarial networks and large language models:an investigation on fidelity, utility, and privacy.
  • Jul 1, 2025
  • Neurosurgical focus
  • Austin A Barr + 3 more

Use of neurosurgical data for clinical research and machine learning (ML) model development is often limited by data availability, sample sizes, and regulatory constraints. Synthetic data offer a potential solution to challenges associated with accessing, sharing, and using real-world data (RWD). The aim of this study was to evaluate the capability of generating synthetic neurosurgical data with a generative adversarial network and large language model (LLM) to augment RWD, perform secondary analyses in place of RWD, and train an ML model to predict postoperative outcomes. Synthetic data were generated with a conditional tabular generative adversarial network (CTGAN) and the LLM GPT-4o based on a real-world neurosurgical dataset of 140 older adults who underwent neurosurgical interventions. Each model was used to generate datasets at equivalent (n = 140) and amplified (n = 1000) sample sizes. Data fidelity was evaluated by comparing univariate and bivariate statistics to the RWD. Privacy evaluation involved measuring the uniqueness of generated synthetic records. Utility was assessed by: 1) reproducing and extending clinical analyses on predictors of Karnofsky Performance Status (KPS) deterioration at discharge and a prolonged postoperative intensive care unit (ICU) stay, and 2) training a binary ML classifier on amplified synthetic datasets to predict KPS deterioration on RWD. Both the CTGAN and GPT-4o generated complete, high-fidelity synthetic tabular datasets. GPT-4o matched or exceeded CTGAN across all measured fidelity, utility, and privacy metrics. All significant clinical predictors of KPS deterioration and prolonged ICU stay were retained in the GPT-4o-generated synthetic data, with some differences observed in effect sizes. Preoperative KPS was not preserved as a significant predictor in the CTGAN-generated data. The ML classifier trained on GPT-4o data outperformed the model trained on CTGAN data, achieving a higher F1 score (0.725 vs 0.688) for predicting KPS deterioration. This study demonstrated a promising ability to produce high-fidelity synthetic neurosurgical data using generative models. Synthetic neurosurgical data present a potential solution to critical limitations in data availability for neurosurgical research. Further investigation is necessary to enhance synthetic data utility for secondary analyses and ML model training, and to evaluate synthetic data generation methods across other datasets, including clinical trial data.

  • Preprint Article
  • 10.2196/preprints.71364
Synthetic Data in Child and Adolescent Mental Health Service Research: A Tool Whose Time has Come. (Preprint)
  • Jan 16, 2025
  • Mounir Haizoune

BACKGROUND High-quality, large-scale healthcare research, especially those using medical records, encounters significant challenges related to technical difficulties and confidentiality issues. As a result, critical research questions about patient evaluation and treatment have been left unanswered. Moreover, the presence of stigma and increased sensitivity surrounding mental health issues have resulted in a significant delay in research progress, particularly concerning Child and Adolescent Mental Health Services (CAMHS). OBJECTIVE These challenges can be effectively addressed by generating synthetic data, which not only safeguard individual privacy but also facilitate comprehensive analyses of clinical information from EMRs and other clinical data sources. To exemplify this method, we have utilized CAMHS synthetic data for planning the allocation of mental health resources, while ensuring confidentiality. In the process, using mental health clinical data, we demonstrate how to create and successfully analyse synthetic data from large-scale EMR-based data to answer critical health care questions for policymakers and clinicians. METHODS The study was carried out on a retrospectively collected cohort comprising 6,924 distinct patients from the Child and Adolescent Mental Health Services (CAMHS) in Stavanger, Norway. The analysis included 7,730 referral periods and a total of 58,524 episodes of care. The full dataset was divided into a training cohort (n = 6184 referrals, 58524 episodes of care) and an independent, fixed test set (n = 1564 referrals, 14,610 episodes of care). A hierarchical synthetic data generation model was used to generate synthetic referral periods with the associated episodes of care based on “real-world” CAMHS data. In addition to the utility of the data, the quality and privacy risk of the generated synthetic data were assessed. RESULTS The synthetic hierarchical data generation model created reproducible synthetic CAMHS data with properties very similar to “real-world” data (KS/TVD Complement score =0.92, CS score =0.77, CS (Inter-table) score =0.75 and CSS score=0.92), while demonstrating low risk score when exposed to a set of privacy attacks (average Singleout score(univariate)=0.17, average Singleout score(multivariate)=0.04, average Linkability risk=2.5, average inference risk=0.7). The predictive model trained on synthetic data produced comparable performance to the model trained on real data in the context of classifying the intensity of care required by patients, all while maintaining the interpretability of the utilized features. (for n = 656, 1546, 3092 and 6184, average PR_AUC = 0.32, 0.33, 0.34 and 0.40 respectively, compared to PR_AUC =0.43 when using n=6184 real data records). CONCLUSIONS Synthetic data in Child and Adolescent Mental Health Services (CAMHS) balances data utility with fairness and privacy protection.It fosters trust between patients and healthcare providers while promoting collaboration among researchers by offering access to extensive and representative samples with a low risk of patient identification. This approach not only encourages data sharing but also expands the breadth of research while safeguarding patient privacy. Effective implementation of synthetic data generation methods in CAMHS depends on the model's ability to accurately identify and replicate the complex patterns present in real data, while maintaining consistency across various outputs. Therefore, selecting the appropriate technique is crucial for achieving accurate and insightful research findings in this field CLINICALTRIAL The synthetic hierarchical data generation model created reproducible synthetic CAMHS data with properties very similar to “real-world” data (for n = 656 ,KS/TVD Complement score =0.92, CS score =0.77, CS (Inter-table) score =0.75 and CSS score=0.92), while demonstrating low risk score when exposed to a set of privacy attacks (for n = 656, average Singleout score(univariate)=0.17, average Singleout score(multivariate)=0.04, average Linkability risk=2.5, average inference risk=0.7). The predictive model trained on synthetic data produced comparable performance to the model trained on real data in the context of classifying the intensity of care required by patients, all while maintaining the interpretability of the utilized features. (for n = 656, 1546, 3092 and 6184, average PR_AUC = 0.32, 0.33, 0.34 and 0.40 respectively, compared to PR_AUC =0.43 when using n=6184 real data records).

  • Research Article
  • Cite Count Icon 24
  • 10.1200/cci.23.00116
Evaluating the Utility and Privacy of Synthetic Breast Cancer Clinical Trial Data Sets
  • Sep 1, 2023
  • JCO Clinical Cancer Informatics
  • Samer El Kababji + 15 more

PURPOSEThere is strong interest from patients, researchers, the pharmaceutical industry, medical journal editors, funders of research, and regulators in sharing clinical trial data for secondary analysis. However, data access remains a challenge because of concerns about patient privacy. It has been argued that synthetic data generation (SDG) is an effective way to address these privacy concerns. There is a dearth of evidence supporting this on oncology clinical trial data sets, and on the utility of privacy-preserving synthetic data. The objective of the proposed study is to validate the utility and privacy risks of synthetic clinical trial data sets across multiple SDG techniques.METHODSWe synthesized data sets from eight breast cancer clinical trial data sets using three types of generative models: sequential synthesis, conditional generative adversarial network, and variational autoencoder. Synthetic data utility was evaluated by replicating the published analyses on the synthetic data and assessing concordance of effect estimates and CIs between real and synthetic data. Privacy was evaluated by measuring attribution disclosure risk and membership disclosure risk.RESULTSUtility was highest using the sequential synthesis method where all results were replicable and the CI overlap most similar or higher for seven of eight data sets. Both types of privacy risks were low across all three types of generative models.DISCUSSIONSynthetic data using sequential synthesis methods can act as a proxy for real clinical trial data sets, and simultaneously have low privacy risks. This type of generative model can be one way to enable broader sharing of clinical trial data.

  • Research Article
  • Cite Count Icon 2
  • 10.1371/journal.pone.0321452
Benchmarking Differential Abundance Tests for 16S microbiome sequencing data using simulated data based on experimental templates.
  • May 19, 2025
  • PloS one
  • Eva Kohnert + 1 more

Differential abundance (DA) analysis of metagenomic microbiome data is essential for understanding microbial community dynamics across various environments and hosts. Identifying microorganisms that differ significantly in abundance between conditions (e.g., health vs. disease) is crucial for insights into environmental adaptations, disease development, and host health. However, the statistical interpretation of microbiome data is challenged by inherent sparsity and compositional nature, necessitating tailored DA methods. This benchmarking study aims to simulate synthetic 16S microbiome data using metaSPARSim (Patuzzi I, Baruzzo G, Losasso C, Ricci A, Di Camillo B. MetaSPARSim: a 16S rRNA gene sequencing count data simulator. BMC Bioinformatics. 2019;20:416. https://doi.org/10.1186/s12859-019-2882-6 PMID: 31757204) MIDASim (He M, Zhao N, Satten GA. MIDASim: a fast and simple simulator for realistic microbiome data. Available from: https://doi.org/10.1101/2023.03.23.533996), and sparseDOSSA2 (Ma S, Ren B, Mallick H, Moon YS, Schwager E, Maharjan S, et al. A statistical model for describing and simulating microbial community profiles. PLOS Comput Biol. 2021;17(9):e1008913. https://doi.org/10.1371/journal.pcbi.1008913 PMID: 34516542) , leveraging 38 real-world experimental templates (S3 Table) previously utilized in a benchmark study comparing DA tools. These datasets, drawn from diverse environments such as human gut, soil, and marine habitats, serve as the foundation for our simulation efforts. We employ the same 14 DA tests that were previously used with the same experimental data in benchmark studies alongside 8 DA tests that were developed subsequently. Initially, we will generate synthetic data closely mirroring the experimental datasets, incorporating a known truth to cover a broad range of real-world data characteristics. This approach allows us to assess the ability of DA methods to recover known true differential abundances. We will further simulate datasets by altering sparsity, effect size, and sample size, thus creating a comprehensive collection for applying the 22 DA tests. The outcomes, focusing on sensitivities and specificities, will provide insights into the performance of DA tests and their dependencies on sparsity, effect size, and sample size. Additionally, we will calculate data characteristics (S1 and S2 Table) for each simulated dataset and use a multiple regression to identify informative data characteristics influencing test performance. Our prior study, where we used simulated data without incorporating a known truth, demonstrated the feasibility of using synthetic data to validate experimental findings. This current study aims to enhance our understanding by systematically evaluating the impact of known truth incorporation on DA test performance, thereby providing further information for the selection and application of DA methods in microbiome research.

  • Research Article
  • Cite Count Icon 4
  • 10.1002/pds.70019
Validation Assessment of Privacy-Preserving Synthetic Electronic Health Record Data: Comparison of Original Versus Synthetic Data on Real-World COVID-19 Vaccine Effectiveness.
  • Oct 1, 2024
  • Pharmacoepidemiology and drug safety
  • Echo Wang + 5 more

To assess the validity of privacy-preserving synthetic data by comparing results from synthetic versus original EHR data analysis. A published retrospective cohort study on real-world effectiveness of COVID-19 vaccines by Maccabi Healthcare Services in Israel was replicated using synthetic data generated from the same source, and the results were compared between synthetic versus original datasets. The endpoints included COVID-19 infection, symptomatic COVID-19 infection and hospitalization due to infection and were also assessed in several demographic and clinical subgroups. In comparing synthetic versus original data estimates, several metrices were utilized: standardized mean differences (SMD), decision agreement, estimate agreement, confidence interval overlap, and Wald test. Synthetic data were generated five times to assess the stability of results. The distribution of demographic and clinical characteristics demonstrated very small difference (< 0.01 SMD). In the comparison of vaccine effectiveness assessed in relative risk reduction between synthetic versus original data, there was a 100% decision agreement, 100% estimate agreement, and a high level of confidence interval overlap (88.7%-99.7%) in all five replicates across all subgroups. Similar findings were achieved in the assessment of vaccine effectiveness against symptomatic COVID-19 Infection. In the comparison of hazard ratios for COVID 19-related hospitalization and odds ratio for symptomatic COVID-19 Infection, the Wald tests suggested no significant difference between respective effect estimates in all five replicates for all patient subgroups but there were disagreements in estimate and decision metrices in some subgroups and replicates. Overall, comparison of synthetic versus original real-world data demonstrated good validity and reliability. Transparency on the process to generate high fidelity synthetic data and assurances of patient privacy are warranted.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 39
  • 10.1186/s12874-023-01869-w
A method for generating synthetic longitudinal health data
  • Mar 23, 2023
  • BMC Medical Research Methodology
  • Lucy Mosquera + 11 more

Getting access to administrative health data for research purposes is a difficult and time-consuming process due to increasingly demanding privacy regulations. An alternative method for sharing administrative health data would be to share synthetic datasets where the records do not correspond to real individuals, but the patterns and relationships seen in the data are reproduced. This paper assesses the feasibility of generating synthetic administrative health data using a recurrent deep learning model. Our data comes from 120,000 individuals from Alberta Health’s administrative health database. We assess how similar our synthetic data is to the real data using utility assessments that assess the structure and general patterns in the data as well as by recreating a specific analysis in the real data commonly applied to this type of administrative health data. We also assess the privacy risks associated with the use of this synthetic dataset. Generic utility assessments that used Hellinger distance to quantify the difference in distributions between real and synthetic datasets for event types (0.027), attributes (mean 0.0417), Markov transition matrices (order 1 mean absolute difference: 0.0896, sd: 0.159; order 2: mean Hellinger distance 0.2195, sd: 0.2724), the Hellinger distance between the joint distributions was 0.352, and the similarity of random cohorts generated from real and synthetic data had a mean Hellinger distance of 0.3 and mean Euclidean distance of 0.064, indicating small differences between the distributions in the real data and the synthetic data. By applying a realistic analysis to both real and synthetic datasets, Cox regression hazard ratios achieved a mean confidence interval overlap of 68% for adjusted hazard ratios among 5 key outcomes of interest, indicating synthetic data produces similar analytic results to real data. The privacy assessment concluded that the attribution disclosure risk associated with this synthetic dataset was substantially less than the typical 0.09 acceptable risk threshold. Based on these metrics our results show that our synthetic data is suitably similar to the real data and could be shared for research purposes thereby alleviating concerns associated with the sharing of real data in some circumstances.

  • Conference Article
  • Cite Count Icon 21
  • 10.1109/cvprw50498.2020.00266
Sensor-realistic Synthetic Data Engine for Multi-frame High Dynamic Range Photography
  • Jun 1, 2020
  • Jinhan Hu + 7 more

Deep learning-based mobile imaging applications are often limited by the lack of training data. To this end, researchers have resorted to using synthetic training data. However, pure synthetic data does not accurately mimic the distribution of the real data. To improve the utility of synthetic data, we present a systematic pipeline that takes synthetic data coming purely from a game engine and then produces synthetic data with real sensor characteristics such as noise and color gamut. We validate the utility of our sensor-realistic synthetic data for multi-frame high dynamic range (HDR) photography using a Samsung Galaxy S10 Plus smartphone. The result of training two baseline neural networks using our sensor realistic synthetic data modeled for the S10 Plus show that our sensor realistic synthetic data improves the quality of HDR photography on the modeled device. The synthetic dataset is publicly available at https://github.com/nadir-zeeshan/sensor-realistic-synthetic-data.

  • Research Article
  • 10.1108/jqme-03-2025-0020
Synthetic maintenance data generation for industrial assets based on historic statistical distribution using pseudo-random algorithm
  • Jan 21, 2026
  • Journal of Quality in Maintenance Engineering
  • Sebastian Diaz Vivas + 3 more

Purpose The article aims to address the challenge of partial or complete absence of maintenance data records for industrial assets by generating synthetic maintenance data under a high-quality maintenance data structure established in the framework of International Organization for Standardization (ISO) 14224:2016. The preceding contributes to maintenance engineering, a strategy to obtain meaningful synthetic data in maintenance management analysis without exposing industrial assets to failures that may lead to undesired consequences. Design/methodology/approach The research was conducted under an experimental study aimed at generating synthetic maintenance data from historical statistical distributions of industrial assets. For experimental purposes, based on the criticality of the studied process context, the research was carried out on a centrifugal pump, with its primary data source from the Offshore Reliability Data Handbook (OREDA), from which the four failure modes with the highest failure rate and the non-maintainable components related to the failure rate by probability were selected. The data were processed using Python 3.10.12, using a methodology of standardizing the data structure, for which a pseudo-code was established. Findings The article addresses the generation of synthetic maintenance data using historical statistical distributions from the OREDA. Two sets of synthetic data were obtained for a centrifugal pump, with the second set maintaining originality by defining the maximum failure rate as the mean of the global failure rate based on accurate data, demonstrated with an error of 1.96%. This approach allows for objective decision-making when forecasting different scenarios, as the synthetic data set acquires its dynamics dependent on the statistical distribution of the failure rate by failure modes, evidenced by the error in the standard deviation. Originality/value The article focuses on generating synthetic maintenance data by developing an algorithm based on internationally recognized statistical distributions aligned with the international standards of ISO 14224:2016. This approach aims to create a synthetic maintenance dataset with maintenance records from which maintenance variables and indicators can be derived. These derived insights enable maintenance optimization through data-driven decision-making feedback loops.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 22
  • 10.1007/s11263-024-02102-x
Synthetic Data for Video Surveillance Applications of Computer Vision: A Review
  • May 17, 2024
  • International Journal of Computer Vision
  • Rita Delussu + 2 more

In recent years, there has been a growing interest in synthetic data for several computer vision applications, such as automotive, detection and tracking, surveillance, medical image analysis and robotics. Early use of synthetic data was aimed at performing controlled experiments under the analysis by synthesis approach. Currently, synthetic data are mainly used for training computer vision models, especially deep learning ones, to address well-known issues of real data, such as manual annotation effort, data imbalance and bias, and privacy-related restrictions. In this work, we survey the use of synthetic training data focusing on applications related to video surveillance, whose relevance has rapidly increased in the past few years due to their connection to security: crowd counting, object and pedestrian detection and tracking, behaviour analysis, person re-identification and face recognition. Synthetic training data are even more interesting in this kind of application, to address further, specific issues arising, e.g., from typically unconstrained image or video acquisition conditions and cross-scene application scenarios. We categorise and discuss the existing methods for creating synthetic data, analyse the synthetic data sets proposed in the literature for each of the considered applications, and provide an overview of their effectiveness as training data. We finally discuss whether and to what extent the existing synthetic data sets mitigate the issues of real data, highlight existing open issues, and suggest future research directions in this field.

  • Research Article
  • Cite Count Icon 1
  • 10.1200/jco.2024.42.16_suppl.e13627
AI-generated synthetic clinical-genomic data for precision oncology research: Validation using a case study on lung adenocarcinoma.
  • Jun 1, 2024
  • Journal of Clinical Oncology
  • Brandon Theodorou + 6 more

e13627 Background: The analysis of genomic variants is crucial in precision oncology research, offering insights into cancer risks and progression, especially in diverse types such as lung adenocarcinoma (LUAD). However, such research often grapples with balancing patient privacy with the need for comprehensive, high-quality genomic datasets. Our project addresses this by creating synthetic clinical-genomic data, which maintains patient confidentiality and provides a rich resource for genomic cancer research. Methods: Leveraging the GuardantINFORM database, which includes anonymized genomic data and structured payer claims, we focused on generating synthetic data for LUAD patient cohorts. This approach involves processing real patient data into a format compatible with Medisyn’s generative AI models, ensuring the synthetic data retains the original's statistical properties, and processing the output back into the original database structure and format. This method plays a crucial role in maintaining patient privacy and serves as a valuable tool for research by enabling the generation of realistic patients with desired properties on demand. Results: Our synthetic data closely mirrors real-world genomic and claims variable distributions, evidenced by a 0.994 R2 correlation between real and synthetic data along with comparable Oncoprints. Importantly, privacy tests show that patient confidentiality is effectively maintained despite this effective performance. The synthetic data's utility was then demonstrated in a study replicating real-world findings: LUAD patients with KRAS G12C in combination with STK11 mutations showed a significantly higher risk of early mortality. This underscores the potential of synthetic data in advancing cancer research. Conclusions: This research offers a promising avenue for the cancer research community. By providing a method to share privatized, synthetic genomic data, which can be combined and generated on demand, we enable broader, more responsible data sharing. This approach protects patient privacy and offers a rich dataset for groundbreaking research, potentially accelerating advances in cancer diagnosis and treatment. [Table: see text]

Save Icon
Up Arrow
Open/Close
Notes

Save Important notes in documents

Highlight text to save as a note, or write notes directly

You can also access these Documents in Paperpal, our AI writing tool

Powered by our AI Writing Assistant