Generation of Multimodal Longitudinal Synthetic Data By Artificial Intelligence to Improve Personalized Medicine in Hematology
Generation of Multimodal Longitudinal Synthetic Data By Artificial Intelligence to Improve Personalized Medicine in Hematology
- Research Article
- 10.1182/blood-2025-396
- Nov 3, 2025
- Blood
Safe: A multimodal, scalable and clinically-oriented comprehensive framework for synthetic data validation in hematology
- Research Article
- 10.1158/1557-3265.aimachine-a058
- Jul 10, 2025
- Clinical Cancer Research
Background: Machine learning models require large, diverse datasets which can be challenging to acquire, even more so for multimodal and paired histology data. Within the I3LUNG European Funded project (NCT05537922), we evaluated multimodal synthetic data generation as a solution to enable domain-specific pretraining and imputation in NSCLC patients treated with immunotherapy (IO) using multimodal data. Methods: Our two-stage method included multimodal data simulation and AI-enabled data evaluation. First, a cross-modal autoencoder jointly embedded histology foundation model features with key clinical features: PD-L1 expression, smoking status, baseline ECOG performance status, histologic subtype, gender, metastatic sites, progression and survival events, LDH, BMI, neutrophil-lymphocyte ratio (NLR), and progression free survival (PFS). The joint latent spaced was sampled using a Gaussian Copula model to generate synthetic patients with coherent multimodal features. Second, to evaluate the fidelity of synthetic clinical representations, we trained a deep neural network models using Cox proportional hazards endpoints on real and simulated data to predict PFS, validating on held-out real patient data. Additionally, we used HistoXGAN to generate paired histology tile images for each synthetic patient. Results: We analyzed NSCLC patients (N=1813) treated with immunotherapy from five centers, split into training (n=1630) and test (n=183) cohorts. The synthetic data matched the original distributions, with minimal differences in continuous features (t-test p > 0.05 and mean differences: BMI -1.72%, PFS -0.37%, LDH -9.18%) and categorical ones (chi-square p > 0.05 and maximum class proportion differences of 3.9%, 15.5%, and 1.2% for bone metastasis, PD-L1 expression, and smoking history respectively). Models trained on synthetic data (N=1000) performed similarly to real data. In validation, the Cox model trained on synthetic data achieved a c-index of 0.683, versus 0.679 for real data (0.6% relative difference). Both synthetic and real data identified consistent prognostic factors (HR [95% CI]): bone metastases (real: 2.36 [1.39-4.80], synthetic: 2.46 [1.39-3.77]), LDH (real: 1.60 [1.16-2.69], synthetic: 1.48 [1.22-2.39]), and liver metastases (real: 1.52 [1.24-3.69], synthetic: 1.46 [1.11-2.80]). Conclusions: Our multimodal synthetic data successfully captured complex multi-feature relationships predictive of PFS in NSCLC patients treated with IO. Synthetic data enables cross-institutional model development while increasing patient privacy, with minimal impact on model performance. This approach paves the way for data democratization, fostering rapid collaboration and mutual validation of AI algorithms. Citation Format: Hanna M. Hieromnimon, Vanja Miskovic, Matteo Sacco, Alberto Ferrarin, Laura Mazzeo, Andrea Spagnoletti, Monica Ganzinelli, Cecilia Silvestri, Leonardo Provenzano, Claudia Proto, Nir Peled, Enriqueta Felip, Helena Linardou, Martin Reck, Francesco Trovo, Giuseppe Lo Russo, Marina Chiara. Garassino, Samantha J. Riesenfeld, Alexander T. Pearson, Arsela Prelaj. Multimodal generative AI jointly learns pathology and clinical data to synthesize a multinational lung cancer cohort [abstract]. In: Proceedings of the AACR Special Conference in Cancer Research: Artificial Intelligence and Machine Learning; 2025 Jul 10-12; Montreal, QC, Canada. Philadelphia (PA): AACR; Clin Cancer Res 2025;31(13_Suppl):Abstract nr A058.
- Abstract
2
- 10.1182/blood-2022-168646
- Nov 15, 2022
- Blood
Synthetic Data Generation By Artificial Intelligence to Accelerate Translational Research and Precision Medicine in Hematological Malignancies
- Discussion
17
- 10.1088/2516-1091/acafbf
- Jan 1, 2023
- Progress in Biomedical Engineering
In silico trial methods promise to improve the path to market for both medicines and medical devices, targeting the development of products, reducing reliance on animal trials, and providing adjunct evidence to bolster regulatory submissions. In silico trials are only as good as the simulated data which underpins them, consequently, often the most difficult challenge when creating robust in silico models is the generation of simulated measurements or even virtual patients that are representative of real measurements and patients. This article digests the current state of the art for generating synthetic patient data outside the context of in silico trials and outlines potential synergies to unlock the potential of in silico trials using virtual populations, by exploiting synthetic patient data to model effects on a more diverse and representative population. Synthetic data could be defined as artificial data that mimic the properties and relationships in real data. Recent advances in synthetic data generation methodologies have allowed for the generation of high-fidelity synthetic data that are both statistically and clinically, indistinguishable from real patient data. Other experimental work has demonstrated that synthetic data generation methods can be used for selective sample boosting of underrepresented groups. This article will provide a brief outline of synthetic data generation approaches and discuss how evaluation frameworks developed to assess synthetic data fidelity and utility could be adapted to evaluate the similarity of virtual patients used for in silico trials, to real patients. The article will then discuss outstanding challenges and areas for further research that would advance both synthetic data generation methods and in silico trial methods. Finally, the article will also provide a perspective on what evidence will be required to facilitate wider acceptance of in silico trials for regulatory evaluation of medicines and medical devices, including implications for post marketing safety surveillance.
- Research Article
2
- 10.3233/shti240490
- Aug 22, 2024
- Studies in health technology and informatics
The sharing of personal health data is highly regulated due to privacy and security concerns. An alternative to sharing personal data is to share synthetic data, because ideally it should be impossible to reconstruct real personal data from synthetic data, which is called privacy. At the same time, the structure of the synthetic data should be as similar as possible to the structure of the real data to ensure that conclusions drawn from the synthetic data are also valid for the real data, which is called fidelity. Typically, there is a tradeoff between fidelity and privacy for synthetic health data. We study the fidelity and privacy of cancer data synthesized using generative machine learning approaches. To generate synthetic cancer data, we use variational autoencoders (VAEs), generative adversarial networks (GANs), and denoising diffusion probabilistic models (DDPMs). The tabular cancer registry data studied have nine categorical variables from breast cancer patients. We find that DDPMs generate synthetic cancer data with higher fidelity; that is, the structure of the synthetic data is more similar to the real cancer data than the data generated by VAEs and GANs. At the same time, synthetic cancer data from DDPMs pose a greater privacy risk because the data are more likely to reveal information from real patients than synthetic data from VAEs and GANs.
- Research Article
2
- 10.1182/blood-2023-187521
- Nov 2, 2023
- Blood
Synthetic Histopathological Images Generation with Artificial Intelligence to Accelerate Research and Improve Clinical Outcomes in Hematology
- Research Article
1
- 10.69554/lqom5698
- Jun 1, 2024
- Journal of Data Protection & Privacy
This paper explores the potential applications of high-fidelity synthetic patient data in the context of healthcare research, including challenges and benefits. The paper starts by defining synthetic data, types of synthetic data and approaches to generating synthetic data. It then discusses the potential applications of synthetic data in addition to as a privacy enhancing technology and current debates around whether synthetic data should be considered personal data and,therefore, should be subjected to privacy controls to minimise reidentification risks. This will be followed by a discussion of privacy preservation approaches and privacy metrics that can be applied in the context of synthetic data. The paper includes a case study based on synthetic electronic healthcare record data from the Clinical Practice Research Datalink on how privacy concerns due to reidentification have been addressed in order to make this data available for research purposes. The authors conclude that synthetic data, particularly high-fidelity synthetic patient data, has the potential to add value over and above real data for public health and that it is possible to address privacy concerns to make synthetic data available via a combination of privacy measures applied during the synthetic data generation process and post-generation reidentification risk assessments as part of data protection impact assessments.
- Research Article
3
- 10.1097/sla.0000000000006871
- Aug 6, 2025
- Annals of surgery
This study aimed to assess artificial intelligence (AI)-based synthetic data (SD) generation technology in surgery, evaluating the accuracy of the generated data and comparing the derived outcomes with real-world data. Trials evaluating new surgical techniques face numerous challenges. SD can play a pivotal role in optimizing clinical trial design, but must be used alongside real-world data to ensure accuracy. Transanal transection and single-stapled anastomosis (TTSS) is a technique with the potential to decrease the anastomotic leak (AL) rate over the double-stapled (DS) technique, according to preliminary data. The original data set included consecutive patients undergoing minimally invasive total mesorectal excision for rectal cancer with DS or TTSS anastomosis between 2010 and 2024. An AI-based generative model was trained to create high-fidelity SD, implemented and tested in a clinical trial setting using the 90-day AL rate as a primary endpoint. We created a synthetic copy of the original cohort (n=653) using the real data to train the model and evaluate its performance using the Synthetic vAlidation FramEwork powered by Train. The comparison between synthetic versus real data demonstrated high statistical fidelity, clinical utility, and privacy preservation. We conditionally generated a balanced cohort (n=1200) with an equal number of patients for both types of anastomoses and strong performances using Synthetic Validation Framework powered by TrainTheSD analysis confirmed real data findings, showing a significantly lower AL rate in the TTSS cohort ( P <0.0001). AI-generated SD showed a high fidelity in replicating the statistical properties and complexity of the clinical features observed in the real-world population, being a very promising tool to improve surgical research.
- Research Article
- 10.63282/3050-9246.ijetcsit-v4i4p111
- Jan 1, 2023
- International Journal of Emerging Trends in Computer Science and Information Technology
The market needs on high-quality, privacy-compliant and scalable test data has grown exponentially as AI-based applications and the software testing needs have grown. Limits Common to Traditional Data Collection. Traditional data collection techniques have weaknesses associated with privacy issues, inadequate coverage of edge cases, and high costs of effort. A new solution to these challenges synthetic data generation via generative models has become a viable option. The aim of the paper is to investigate how recent advances in generative models, such as Variational Autoencoders (VAEs), Generative Adversarial Networks (GANs), and Diffusion Models, can be used to create synthetic test datasets that have statistical fidelity while also ensuring user privacy. Explain what architectural elements, training, and validation techniques were employed in building such models, with special consideration of maintaining data diversity and realism. The experimental findings indicate that modern generative models are capable of producing synthetic data that closely resembles the real-world distribution and can be used to substantially increase software test coverage, especially in covering edge cases and areas where compliance is relevant, such as finance and healthcare. Moreover, the combination of the differential privacy mechanisms proves the possibility of regulated and secure synthetic data pipelines. This paper highlights the advantages, challenges, and potential applications of generative models in synthetic data generation. These findings suggest that hybrid methods, which combine both synthetic and minimally obfuscated real data, are the most effective approach to strike a balance between realism, privacy, and practical usefulness in real-world testing situations
- Research Article
1
- 10.1200/jco.2024.42.16_suppl.e13627
- Jun 1, 2024
- Journal of Clinical Oncology
e13627 Background: The analysis of genomic variants is crucial in precision oncology research, offering insights into cancer risks and progression, especially in diverse types such as lung adenocarcinoma (LUAD). However, such research often grapples with balancing patient privacy with the need for comprehensive, high-quality genomic datasets. Our project addresses this by creating synthetic clinical-genomic data, which maintains patient confidentiality and provides a rich resource for genomic cancer research. Methods: Leveraging the GuardantINFORM database, which includes anonymized genomic data and structured payer claims, we focused on generating synthetic data for LUAD patient cohorts. This approach involves processing real patient data into a format compatible with Medisyn’s generative AI models, ensuring the synthetic data retains the original's statistical properties, and processing the output back into the original database structure and format. This method plays a crucial role in maintaining patient privacy and serves as a valuable tool for research by enabling the generation of realistic patients with desired properties on demand. Results: Our synthetic data closely mirrors real-world genomic and claims variable distributions, evidenced by a 0.994 R2 correlation between real and synthetic data along with comparable Oncoprints. Importantly, privacy tests show that patient confidentiality is effectively maintained despite this effective performance. The synthetic data's utility was then demonstrated in a study replicating real-world findings: LUAD patients with KRAS G12C in combination with STK11 mutations showed a significantly higher risk of early mortality. This underscores the potential of synthetic data in advancing cancer research. Conclusions: This research offers a promising avenue for the cancer research community. By providing a method to share privatized, synthetic genomic data, which can be combined and generated on demand, we enable broader, more responsible data sharing. This approach protects patient privacy and offers a rich dataset for groundbreaking research, potentially accelerating advances in cancer diagnosis and treatment. [Table: see text]
- Abstract
7
- 10.1182/blood-2019-125252
- Nov 13, 2019
- Blood
Preoperative Anemia and Blood Transfusion Requirement during Hip Surgery: Synthetic and Real Patient Cohort Data
- Research Article
1
- 10.3934/aci.2024009
- Jan 1, 2024
- Applied Computing and Intelligence
<p>The use of synthetic data could facilitate data-driven innovation across industries and applications. Synthetic data can be generated using a range of methods, from statistical modeling to machine learning and generative AI, resulting in datasets of different formats and utility. In the health sector, the use of synthetic data is often motivated by privacy concerns. As generative AI is becoming an everyday tool, there is a need for practice-oriented insights into the prospects and limitations of synthetic data, especially in the privacy sensitive domains. We present an interdisciplinary outlook on the topic, focusing on, but not limited to, the Finnish regulatory context. First, we emphasize the need for working definitions to avoid misplaced assumptions. Second, we consider use cases for synthetic data, viewing it as a helpful tool for experimentation, decision-making, and building data literacy. Yet the complementary uses of synthetic datasets should not diminish the continued efforts to collect and share high-quality real-world data. Third, we discuss how privacy-preserving synthetic datasets fall into the existing data protection frameworks. Neither the process of synthetic data generation nor synthetic datasets are automatically exempt from the regulatory obligations concerning personal data. Finally, we explore the future research directions for generating synthetic data and conclude by discussing potential future developments at the societal level.</p>
- Research Article
- 10.1093/ijlit/eaag002
- Jan 12, 2026
- International Journal of Law and Information Technology
Artificial intelligence (AI) has the potential to transform healthcare, but this requires access to health data. Synthetic data generated through training machine learning models on real data offers a way to balance innovation and privacy protection. However, uncertainties in the practical classification of synthetic health data under the General Data Protection Regulation (GDPR) currently limits the possible benefits of synthetic data. Through a systematic analysis of relevant legal sources and an empirical study, this article explores whether synthetic data should be classified as personal data under the GDPR. The study investigates the residual identification risk through generating synthetic data and simulating inference attacks, challenging common perceptions of technical identification risk. The risk of identification depends on several factors. The findings suggest synthetic data are often likely anonymous since results of an attack cannot easily be verified. The legal analysis highlights uncertainties about what constitutes a ‘reasonably likely’ risk and a need to further investigate a threshold for accepted risk. To promote innovation, the study calls for clearer regulations to balance privacy protection with the advancement of AI in healthcare.
- Research Article
6
- 10.1097/hep.0000000000001299
- Mar 11, 2025
- Hepatology (Baltimore, Md.)
Clinical hepatology research often faces limited data availability, underrepresentation of minority groups, and complex data-sharing regulations. Synthetic data-artificially generated patient records designed to mirror real-world distributions-offers a potential solution. We hypothesized that diffusion models, a state-of-the-art generative technique, could produce synthetic liver transplant waitlist data from the United Network for Organ Sharing database that maintains statistical fidelity, replicates clinical correlations and survival patterns, and ensures robust privacy protection. Diffusion models were used to generate synthetic patient cohorts mirroring the United Network for Organ Sharing liver transplant waitlist database between the years 2019 and 2023. Statistical fidelity was assessed using maximum mean discrepancy (MMD) and Wasserstein distance, correlation analysis, and variable-level metrics. Clinical utility was evaluated by comparing transplant-free survival via Kaplan-Meier curves and the MELD score performance. Privacy was quantified using the Distance to Closest Record (DCR) and attribute disclosure risk assessments.The synthetic dataset was nearly indistinguishable from the original dataset (MMD=0.002, standardized Wasserstein distance <1.0), preserving clinically relevant correlations and survival patterns as evidenced by similar median survival times (110 vs. 101 days) and 5-year survival rates (22.2% vs. 22.8%). MELD-based 90-day mortality prediction was maintained (original AUC=0.839 vs. synthetic AUC=0.844). Privacy metrics indicated no identifiable patient matches, and mean DCR values ensured that synthetic individuals were not direct replicas of real patients. Artificial intelligence-generated synthetic data derived from diffusion models can faithfully replicate complex hepatology datasets, maintain key clinical signals, and ensure strong privacy safeguards. This approach can help address data scarcity, enhance model generalizability, foster multi-institutional collaboration, and accelerate progress in hepatology research.
- Research Article
1
- 10.3233/shti250398
- May 15, 2025
- Studies in health technology and informatics
Synthetic data, generated using generative AI techniques, closely mimics the characteristics of real data while enhancing privacy for sensitive health data. This study evaluates synthetic tabular data based on fidelity and utility for predictive models. Fidelity is measured through univariate distribution and bivariate differential pairwise correlations, while utility is measured by comparing machine learning model performance trained on synthetic and real data. Results show highly similar model performance on synthetic and real data. We also explore the potential of using synthetic data for hyperparameter tuning. Our findings reveal a strong correlation between prediction accuracy on synthetic and real data, suggesting that hyperparameters optimized using synthetic data can be effectively applied to models trained on real datasets for optimal results.