VR-Based Generation of Photorealistic Synthetic Data for Training Hand-Object Tracking Models
Supervised learning models for precise tracking of handobject interactions (HOI) in 3D require large amounts of annotated data for training. Moreover, it is not intuitive for non-experts to label 3D ground truth (e.g. 6DoF object pose) on 2D images. To address these issues, we present "blenderhoisynth", an interactive synthetic data generator based on the Blender software. Blender-hoisynth can scalably generate and automatically annotate visual HOI training data. Other competing approaches usually generate synthetic HOI data compeletely without human input. While this may be beneficial in some scenarios, HOI applications inherently necessitate direct control over the HOIs as an expression of human intent. With blender-hoisynth, it is possible for users to interact with objects via virtual hands using standard Virtual Reality hardware. The synthetically generated data are characterized by a high degree of photorealism and contain visually plausible and physically realistic videos of hands grasping objects and moving them around in 3D. To demonstrate the efficacy of our data generation, we replace large parts of the training data in the well-known DexYCB dataset with hoisynth data and train a state-of-the-art HOI reconstruction model with it. We show that there is no significant degradation in the model performance despite the data replacement.
- Preprint Article
- 10.2196/preprints.71364
- Jan 16, 2025
BACKGROUND High-quality, large-scale healthcare research, especially those using medical records, encounters significant challenges related to technical difficulties and confidentiality issues. As a result, critical research questions about patient evaluation and treatment have been left unanswered. Moreover, the presence of stigma and increased sensitivity surrounding mental health issues have resulted in a significant delay in research progress, particularly concerning Child and Adolescent Mental Health Services (CAMHS). OBJECTIVE These challenges can be effectively addressed by generating synthetic data, which not only safeguard individual privacy but also facilitate comprehensive analyses of clinical information from EMRs and other clinical data sources. To exemplify this method, we have utilized CAMHS synthetic data for planning the allocation of mental health resources, while ensuring confidentiality. In the process, using mental health clinical data, we demonstrate how to create and successfully analyse synthetic data from large-scale EMR-based data to answer critical health care questions for policymakers and clinicians. METHODS The study was carried out on a retrospectively collected cohort comprising 6,924 distinct patients from the Child and Adolescent Mental Health Services (CAMHS) in Stavanger, Norway. The analysis included 7,730 referral periods and a total of 58,524 episodes of care. The full dataset was divided into a training cohort (n = 6184 referrals, 58524 episodes of care) and an independent, fixed test set (n = 1564 referrals, 14,610 episodes of care). A hierarchical synthetic data generation model was used to generate synthetic referral periods with the associated episodes of care based on “real-world” CAMHS data. In addition to the utility of the data, the quality and privacy risk of the generated synthetic data were assessed. RESULTS The synthetic hierarchical data generation model created reproducible synthetic CAMHS data with properties very similar to “real-world” data (KS/TVD Complement score =0.92, CS score =0.77, CS (Inter-table) score =0.75 and CSS score=0.92), while demonstrating low risk score when exposed to a set of privacy attacks (average Singleout score(univariate)=0.17, average Singleout score(multivariate)=0.04, average Linkability risk=2.5, average inference risk=0.7). The predictive model trained on synthetic data produced comparable performance to the model trained on real data in the context of classifying the intensity of care required by patients, all while maintaining the interpretability of the utilized features. (for n = 656, 1546, 3092 and 6184, average PR_AUC = 0.32, 0.33, 0.34 and 0.40 respectively, compared to PR_AUC =0.43 when using n=6184 real data records). CONCLUSIONS Synthetic data in Child and Adolescent Mental Health Services (CAMHS) balances data utility with fairness and privacy protection.It fosters trust between patients and healthcare providers while promoting collaboration among researchers by offering access to extensive and representative samples with a low risk of patient identification. This approach not only encourages data sharing but also expands the breadth of research while safeguarding patient privacy. Effective implementation of synthetic data generation methods in CAMHS depends on the model's ability to accurately identify and replicate the complex patterns present in real data, while maintaining consistency across various outputs. Therefore, selecting the appropriate technique is crucial for achieving accurate and insightful research findings in this field CLINICALTRIAL The synthetic hierarchical data generation model created reproducible synthetic CAMHS data with properties very similar to “real-world” data (for n = 656 ,KS/TVD Complement score =0.92, CS score =0.77, CS (Inter-table) score =0.75 and CSS score=0.92), while demonstrating low risk score when exposed to a set of privacy attacks (for n = 656, average Singleout score(univariate)=0.17, average Singleout score(multivariate)=0.04, average Linkability risk=2.5, average inference risk=0.7). The predictive model trained on synthetic data produced comparable performance to the model trained on real data in the context of classifying the intensity of care required by patients, all while maintaining the interpretability of the utilized features. (for n = 656, 1546, 3092 and 6184, average PR_AUC = 0.32, 0.33, 0.34 and 0.40 respectively, compared to PR_AUC =0.43 when using n=6184 real data records).
- Abstract
1
- 10.1182/blood-2022-171057
- Nov 15, 2022
- Blood
Systematic Evaluation of Synthetic Panel Data Quality with an Application to Chronic Lymphocytic Leukemia
- Research Article
3
- 10.3390/s24092750
- Apr 25, 2024
- Sensors
Biometric authentication plays a vital role in various everyday applications with increasing demands for reliability and security. However, the use of real biometric data for research raises privacy concerns and data scarcity issues. A promising approach using synthetic biometric data to address the resulting unbalanced representation and bias, as well as the limited availability of diverse datasets for the development and evaluation of biometric systems, has emerged. Methods for a parameterized generation of highly realistic synthetic data are emerging and the necessary quality metrics to prove that synthetic data can compare to real data are open research tasks. The generation of 3D synthetic face data using game engines' capabilities of generating varied realistic virtual characters is explored as a possible alternative for generating synthetic face data while maintaining reproducibility and ground truth, as opposed to other creation methods. While synthetic data offer several benefits, including improved resilience against data privacy concerns, the limitations and challenges associated with their usage are addressed. Our work shows concurrent behavior in comparing semi-synthetic data as a digital representation of a real identity with their real datasets. Despite slight asymmetrical performance in comparison with a larger database of real samples, a promising performance in face data authentication is shown, which lays the foundation for further investigations with digital avatars and the creation and analysis of fully synthetic data. Future directions for improving synthetic biometric data generation and their impact on advancing biometrics research are discussed.
- Research Article
3
- 10.3171/2025.4.focus25225
- Jul 1, 2025
- Neurosurgical focus
Use of neurosurgical data for clinical research and machine learning (ML) model development is often limited by data availability, sample sizes, and regulatory constraints. Synthetic data offer a potential solution to challenges associated with accessing, sharing, and using real-world data (RWD). The aim of this study was to evaluate the capability of generating synthetic neurosurgical data with a generative adversarial network and large language model (LLM) to augment RWD, perform secondary analyses in place of RWD, and train an ML model to predict postoperative outcomes. Synthetic data were generated with a conditional tabular generative adversarial network (CTGAN) and the LLM GPT-4o based on a real-world neurosurgical dataset of 140 older adults who underwent neurosurgical interventions. Each model was used to generate datasets at equivalent (n = 140) and amplified (n = 1000) sample sizes. Data fidelity was evaluated by comparing univariate and bivariate statistics to the RWD. Privacy evaluation involved measuring the uniqueness of generated synthetic records. Utility was assessed by: 1) reproducing and extending clinical analyses on predictors of Karnofsky Performance Status (KPS) deterioration at discharge and a prolonged postoperative intensive care unit (ICU) stay, and 2) training a binary ML classifier on amplified synthetic datasets to predict KPS deterioration on RWD. Both the CTGAN and GPT-4o generated complete, high-fidelity synthetic tabular datasets. GPT-4o matched or exceeded CTGAN across all measured fidelity, utility, and privacy metrics. All significant clinical predictors of KPS deterioration and prolonged ICU stay were retained in the GPT-4o-generated synthetic data, with some differences observed in effect sizes. Preoperative KPS was not preserved as a significant predictor in the CTGAN-generated data. The ML classifier trained on GPT-4o data outperformed the model trained on CTGAN data, achieving a higher F1 score (0.725 vs 0.688) for predicting KPS deterioration. This study demonstrated a promising ability to produce high-fidelity synthetic neurosurgical data using generative models. Synthetic neurosurgical data present a potential solution to critical limitations in data availability for neurosurgical research. Further investigation is necessary to enhance synthetic data utility for secondary analyses and ML model training, and to evaluate synthetic data generation methods across other datasets, including clinical trial data.
- Abstract
2
- 10.1182/blood-2022-168646
- Nov 15, 2022
- Blood
Synthetic Data Generation By Artificial Intelligence to Accelerate Translational Research and Precision Medicine in Hematological Malignancies
- Research Article
- 10.37934/araset.53.1.237248
- Oct 7, 2024
- Journal of Advanced Research in Applied Sciences and Engineering Technology
Synthetic image data generation has gained popularity in computer vision and machine learning in recent years. The work introduces a technique for creating artificial image data by utilizing 3D files and rendering methods in Python and Blender. The technique employs BlenderProc, a rendering tool for generating artificial images, to efficiently create a substantial amount of data. The output of the method is saved in JSON format, containing COCO annotations of objects in the images, facilitating seamless integration with current machine-learning pipelines. The paper shows that the created synthetic data can be used to enhance object data during simulation. The method can enhance the accuracy and robustness of machine-learning models by modifying simulation parameters like lighting, camera position, and object orientation to create a variety of images. This is especially beneficial for applications that require significant amounts of labelled real-world data, which can be time-consuming and labour-intensive to obtain. The study addresses the constraints and potential prejudices of creating synthetic data, emphasizing the significance of verifying and assessing the generated data prior to its utilization in machine learning models. Synthetic data generation can be a valuable tool for improving the efficiency and effectiveness of machine learning and computer vision applications. However, it is crucial to thoroughly assess the potential limitations and biases of the generated data. This paper emphasizes the potential of synthetic data generation to enhance the accuracy and resilience of machine learning models, especially in scenarios with limited access to labelled real-world data. This paper introduces a method that efficiently produces substantial amounts of synthetic image data with COCO annotations, serving as a valuable resource for professionals in computer vision and machine learning.
- Research Article
- 10.37934/araset.62.1.158169
- Oct 14, 2024
- Journal of Advanced Research in Applied Sciences and Engineering Technology
Synthetic image data generation has gained popularity in computer vision and machine learning in recent years. The work introduces a technique for creating artificial image data by utilizing 3D files and rendering methods in Python and Blender. The technique employs BlenderProc, a rendering tool for generating artificial images, to efficiently create a substantial amount of data. The output of the method is saved in JSON format, containing COCO annotations of objects in the images, facilitating seamless integration with current machine-learning pipelines. The paper shows that the created synthetic data can be used to enhance object data during simulation. The method can enhance the accuracy and robustness of machine-learning models by modifying simulation parameters like lighting, camera position, and object orientation to create a variety of images. This is especially beneficial for applications that require significant amounts of labelled real-world data, which can be time-consuming and labour-intensive to obtain. The study addresses the constraints and potential prejudices of creating synthetic data, emphasizing the significance of verifying and assessing the generated data prior to its utilization in machine learning models. Synthetic data generation can be a valuable tool for improving the efficiency and effectiveness of machine learning and computer vision applications. However, it is crucial to thoroughly assess the potential limitations and biases of the generated data. This paper emphasizes the potential of synthetic data generation to enhance the accuracy and resilience of machine learning models, especially in scenarios with limited access to labelled real-world data. This paper introduces a method that efficiently produces substantial amounts of synthetic image data with COCO annotations, serving as a valuable resource for professionals in computer vision and machine learning.
- Research Article
7
- 10.1190/tle41060392.1
- Jun 1, 2022
- The Leading Edge
This paper discusses the generation of synthetic 3D seismic data for training neural networks to solve a variety of seismic processing, interpretation, and inversion tasks. Using synthetic data is a way to address the shortage of seismic data, which are required for solving problems with machine learning techniques. Synthetic data are built via a simulation process that is based on a mathematical representation of the physics of the problem. In other words, using synthetic data is an indirect way to teach neural networks about the physics of the problem. An important incentive for using synthetic data to solve problems with artificial intelligence methods is that with real seismic data the ground truth is always unknown. When generating synthetic seismic data, we first build the model and then calculate the data, so the answer (model) is always known and always exact. We describe a methodology for generating on-the-fly simulated postmigration (1D modeling) synthetic data in 3D, which are high resolution and look similar to real data. A wide range of models is covered by generating an unlimited number of data examples. The synthetic data are built from impedance models that are constructed through geostatistical simulation of real well logs. With geostatistical simulation, we can describe various geologic variance models in 3D and obtain realistic images. To cover a broad range of scenarios, we need to generalize the seismic data story by randomly perturbing many parameters including structures, conformity styles, dip-strike directions, variograms, measured input logs, frequencies, phase spectra, etc.
- Abstract
- 10.1182/blood-2024-209541
- Nov 5, 2024
- Blood
Generation of Multimodal Longitudinal Synthetic Data By Artificial Intelligence to Improve Personalized Medicine in Hematology
- Research Article
2
- 10.1109/jbhi.2024.3520156
- Feb 1, 2025
- IEEE journal of biomedical and health informatics
The limited availability of diverse, high-quality datasets is a significant challenge in applying deep learning to neuroimaging research. Although synthetic data generation can potentially address this issue, on-the-fly generation is computationally demanding, while training on pre-generated data is inflexible and may incur high storage costs. We introduce Wirehead, a scalable in-memory data pipeline that significantly improves the performance of on-the-fly synthetic data generation for deep learning in neuroimaging. Wirehead's architecture decouples data generation from training by running multiple generators in independent parallel processes, facilitating near-linear performance gains proportional to the number of generators used. It efficiently handles terabytes of data using MongoDB, greatly minimizing prohibitive storage costs. The robust, modular design enables flexible pipeline configurations and fault-tolerant operation. We evaluated Wirehead with SynthSeg, a synthetic brain segmentation data generation tool that requires 7 days to train a model. When deployed in parallel, Wirehead achieved a near-linear 15.7x increase in throughput with 16 generators. With 20 generators, we can train a model in 9 hours instead of 7 days. This demonstrates Wirehead's ability to greatly accelerate experimentation cycles. While Wirehead represents a substantial step forward, it also reveals opportunities for future research in optimizing generation-training balance and resource allocation. Its ability to facilitate distributed deep learning has significant implications for enabling more ambitious neuroimaging research.
- Conference Article
- 10.54941/ahfe1006801
- Jan 1, 2025
- AHFE international
The application of synthetic data within the biomedical domain is rapidly gaining momentum, driven by the growing need for robust datasets suitable for machine learning (ML) and statistical modeling. In scenarios where access to real patient data is limited due to privacy concerns or scarcity, synthetic data offers an attractive alternative. These artificially generated datasets aim to mimic the statistical characteristics of original data, enabling researchers to conduct exploratory analysis, develop predictive models, or validate findings without compromising patient confidentiality. However, the increasing use of synthetic data raises several methodological and interpretative challenges, particularly regarding the correct sequence and context for applying statistical analyses. One of the central issues identified in contemporary literature concerns the timing of data analysis relative to the synthetic data generation process. Some studies conduct statistical or ML analyses directly on real datasets and use synthetic data for validation or augmentation. Others, conversely, perform all stages of analysis including feature importance estimation, correlation assessment, and model training on synthetic data. This inconsistency raises the question of whether statistical analysis conducted solely on synthetic datasets yields reliable insights, or whether it constitutes a methodological flaw. The prevailing assumption is that analysis should ideally be performed on real data to preserve statistical integrity, but empirical evaluation of this notion remains limited. In the current study, the authors address this issue by applying a synthetic data generation method specifically, the Tabular Variational Auto encoder (TVAE) to a biomedical dataset focused on bladder cancer recurrence. This dataset includes various diagnostic variables, and the primary goal is to assess how well synthetic data replicates analytical insights drawn from the original data. To achieve this, the authors conduct both correlational analysis and machine learning-based feature importance estimation. The results derived from synthetic datasets of varying sizes are then compared to those obtained from the original data. The findings indicate that while synthetic data can approximate general trends observed in the original dataset, there are notable differences depending on the analytical technique employed. In particular, models such as Random Forest appear more sensitive to variations introduced during the synthetization process. This sensitivity manifests as shifts in feature importance rankings and variability in predictive performance, especially when working with smaller synthetic datasets. On the other hand, simpler statistical methods such as correlation coefficients display more stability, suggesting that some analytical approaches may be more robust to data generation artifacts than others. These observations underscore the importance of methodological caution when interpreting results based on synthetic biomedical data. While synthetic datasets hold considerable promise for advancing data-driven research in biomedicine, they are not a one-size-fits-all solution. The sequence in which synthetic data is introduced into the research pipeline whether before or after statistical analysis—can significantly influence the validity of the findings. As such, researchers must critically assess the suitability of synthetic data for specific analytical tasks and ensure transparency in reporting their methodological choices. Future work should further explore the impact of different generative models and dataset properties on the reliability of synthetic-data-driven insights.
- Research Article
1
- 10.1093/ndt/gfad063c_5490
- Jun 14, 2023
- Nephrology Dialysis Transplantation
Background and Aims Synthetic data can be an effective supplement or alternative to real data for the training of machine learning models. Synthetic data may also be used to evaluate new tools, develop educational curricula, or remove undesirable biases in datasets. We aim to evaluate four synthetic data generation methods applied to hypertension randomized clinical trial data. Method The Systolic Blood Pressure Intervention Trial (SPRINT) trial showed that intensive BP control to SBP <120 mm Hg results in significant cardiovascular benefits in high-risk patients with hypertension compared with routine BP control to <140 mm Hg. The Synthetic Data Vault (SDV) is a Synthetic Data Generation ecosystem of libraries that allows users to easily generate new Synthetic Data that has the same format and statistical properties as the original dataset. SDV supports multiple types of data, including date-times, discrete-ordinal, categorical, and numerical. SPRINT data was pre-processed to create a single table of 140,000 patient visits with baseline variables (age, sex, race, aspirin use, estimated Glomerular Filtration Rate (eGFR)) and visit level variables (systolic and diastolic blood pressure, heart rate and total number of antihypertensive medications at end of visit). Using the SDV library for python, we used four generative models to create synthetic SPRINT data, 1. Gaussian copula model, 2. Conditional Tabular Generative adversarial network (CTGAN), 3. CopulaGan model, and 4. Tabular Variational Auto-encode (TVAE). We evaluated the results using the SDMetrics library which includes the shapes of the columns (marginal distributions), the pairwise trends between the columns (correlations), reproduce mathematical properties from your original data and new row synthesis. Finally, an overall quality score which represents an amalgamation of the marginal distribution and correlations was computed, where 0 indicates the lowest quality and 1 indicates the highest. Results Two hundred thousand synthetic patient visits were created for each method. The overall quality scores in order were 90.67% for Gaussian copula, 86.77% for TVAE, 81.03% for CTGAN’, and 79.7% for CopulaGAN. The column shape score which represents the marginal distribution was highest for Gaussian Copula (94.54%), followed by TVAE (88.44%), CTGAN (82.35%), and Copula GAN (80.27%). The column pair trend which corresponds to correlations was highest for Gaussian Copula (86.8%), followed by TAVE (85.1%), CTGAN (79.72%), and Copula GAN (79.12%). Conclusion Gaussian copula created the highest scoring synthetic SPRINT data based on the marginal distribution, correlations, and overall score. The Synthetic Data Vault is a feasible collection of methods for generation of synthetic clinical trial data for training future machine learning and AI models.
- Research Article
- 10.1093/jamiaopen/ooaf137
- Nov 3, 2025
- JAMIA Open
ObjectiveTo evaluate effectiveness of open-source generative models in producing high-quality tabular synthetic data using a Health and Demographic Surveillance System (HDSS) dataset from rural Kenya, as a proof of concept in a low- and middle-income (LMIC) setting.Materials and MethodsThree open-source models (CTGAN, TableGAN, and CopulaGAN) were used to generate synthetic data from the Kaloleni/Rabai HDSS dataset. To assess the quality of the synthetic datasets generated by each model, we performed fidelity, utility, and privacy tests.ResultsCTGAN outperformed the other models, producing synthetic data that closely mirrored the statistical properties of the real dataset while preserving privacy. Both CopulaGAN and TableGAN performed poorly, with TableGAN completely failing to generate realistic synthetic data. For the utility tests, Random Forest models trained on CTGAN-generated synthetic data achieved comparable performance to models trained on real data (accuracy: 72.4% vs 72.0%, P = .38; F1 score: 71.4% vs 68.3%, P = .22), indicating no statistically significant loss in predictive utility. The CTGAN model also yielded higher precision and recall than CopulaGAN, suggesting that the synthetic data generated by CTGAN better preserved the underlying structure of the real data.DiscussionCTGAN demonstrated superior performance in generating high-quality synthetic tabular HDSS data. CopulaGAN and TableGAN produced lower quality data, though these results may not generalize to other datasets.ConclusionSynthetic data generation of tabular data using HDSS data, particularly via CTGAN, may enhance the accessibility of datasets in LMICs by creating synthetic datasets that preserve the characteristics and statistical properties of the original data, while upholding privacy and confidentiality.
- Research Article
- 10.1108/jqme-03-2025-0020
- Jan 21, 2026
- Journal of Quality in Maintenance Engineering
Purpose The article aims to address the challenge of partial or complete absence of maintenance data records for industrial assets by generating synthetic maintenance data under a high-quality maintenance data structure established in the framework of International Organization for Standardization (ISO) 14224:2016. The preceding contributes to maintenance engineering, a strategy to obtain meaningful synthetic data in maintenance management analysis without exposing industrial assets to failures that may lead to undesired consequences. Design/methodology/approach The research was conducted under an experimental study aimed at generating synthetic maintenance data from historical statistical distributions of industrial assets. For experimental purposes, based on the criticality of the studied process context, the research was carried out on a centrifugal pump, with its primary data source from the Offshore Reliability Data Handbook (OREDA), from which the four failure modes with the highest failure rate and the non-maintainable components related to the failure rate by probability were selected. The data were processed using Python 3.10.12, using a methodology of standardizing the data structure, for which a pseudo-code was established. Findings The article addresses the generation of synthetic maintenance data using historical statistical distributions from the OREDA. Two sets of synthetic data were obtained for a centrifugal pump, with the second set maintaining originality by defining the maximum failure rate as the mean of the global failure rate based on accurate data, demonstrated with an error of 1.96%. This approach allows for objective decision-making when forecasting different scenarios, as the synthetic data set acquires its dynamics dependent on the statistical distribution of the failure rate by failure modes, evidenced by the error in the standard deviation. Originality/value The article focuses on generating synthetic maintenance data by developing an algorithm based on internationally recognized statistical distributions aligned with the international standards of ISO 14224:2016. This approach aims to create a synthetic maintenance dataset with maintenance records from which maintenance variables and indicators can be derived. These derived insights enable maintenance optimization through data-driven decision-making feedback loops.
- Research Article
- 10.1158/1538-7445.am2019-1641
- Jul 1, 2019
- Cancer Research
While machine learning (ML) has shown some promise in medical research, its actual impact has been limited relative to other application domains. One reason for this disparity is the lack of high-quality, patient-level data available to the broader ML research community. Such datasets are often not made available due to protections around patient privacy. To overcome these obstacles, high-quality, synthetic datasets could be leveraged to accelerate methodological developments in the application of ML to biomedical research. Clinical data in the form of electronic health records present a rich data source to be used for synthetic data generation. Such data can be high dimensional and predominantly categorical, which poses multiple challenges from a modeling perspective. In this paper, we evaluate four classes of synthetic data generation techniques, as well as several metrics for evaluating the quality of the synthetic data. While the results and discussions are broadly applicable to medical data, for demonstration purposes we generate synthetic datasets from the publicly available Surveillance Epidemiology and End Results (SEER) program. Specifically, our cohort consists of breast cancer cases diagnosed in the year of 2010, which includes over 26000 individual cases. Finally, we discuss the trade-offs of the different methods and metrics, providing guidance on considerations for the generation and usage of synthetic medical data. Citation Format: Andre R. Goncalves, Priyadip Ray, Braden Soper, Madhumita Myneni, Jennifer L. Stevens, Linda M. Coyle, Ana Paula Sales. Generation and evaluation of medical synthetic data [abstract]. In: Proceedings of the American Association for Cancer Research Annual Meeting 2019; 2019 Mar 29-Apr 3; Atlanta, GA. Philadelphia (PA): AACR; Cancer Res 2019;79(13 Suppl):Abstract nr 1641.