Background In hematological malignancies there is a growing demand for real-world, comprehensive data including clinical and genomic information to build powerful models to improve diagnosis, prognosis and personalized treatment choice. However, collecting such information in large patient populations is challenging and there are many issues concerning patient privacy that need to be accounted for. One approach that can circumvent these issues is the creation of synthetic data that captures the complexities of the original data set (distributions, non-linear relationships, and noise) without including any real patient information. Aims 1) Apply advanced synthetic data generation methods to real-world datasets of different hematological malignancies. 2) Develop a Synthetic Validation Framework to evaluate the quality of synthetic data and perform data augmentation. 3) Test the capability of synthetic data to accelerate translational research. Methods Here we implemented a Conditional Tabular Wasserstein Generative Adversarial Networks (GAN) architecture with Gradient Penalty to generate synthetic data. Use cases were different cohorts of patients with myelodysplastic syndrome (MDS) and acute myeloid leukemia (AML) with available clinical and molecular features. We created a Synthetic Validation Framework to evaluate the quality of generated synthetic data: Clinical Synthetic Fitness (CSF) and Genomic Synthetic Fitness (GSF) scores were calculated as the average of multiple metric tests adopted. Patients were stratified by Hierarchical Dirichlet (HD) clustering. Explainability analysis was carried out by SHapley Additive exPlanations approach (SHAP). Survival analyses were performed by Kaplan-Meier curves and CoxPH models (Experimental plan is reported in Figure 1). Results We first created a synthetic copy of a MDS cohort (n=2,043) using all the real data for training the model. We compared synthetic vs. real data, obtaining high fitness performances for both clinical and genomic features (CSF=93%; GSF=90%). HD were then applied to define clusters capturing broad dependencies among genomic features, showing comparable results in synthetic vs. real data; SHAP analysis indicated that similar features drive patients’ classification in both datasets. Finally, synthetic patients had comparable survival with respect to real ones; when applying conventional scoring system (IPSS-R), the probability of survival of the 5 risk categories was comparable between synthetic and real data. In the second experiment setting, we analysed synthetic MDS datasets with different size generated with model trained on a real dataset. Interestingly, when generating a synthetic augmented dataset (200%) we obtained high fitness performance for both clinical and genomic features (CSF=91%; GSF=89%). Moreover, all the performances showed a similar trend when considering a cohort of 1,002 patients with AML (CSF=92%; GSF=89%) thus proving evidence for high generalizability of the model across different clinical settings. Finally, we investigated if the generation of synthetic data can accelerate translational research in hematology. Since the first publication on clinical relevance of gene mutations in MDS (Leukemia 2014;28:241), it took several years to collect data in large patient populations for generating a molecular classification (JCO 2021;39:1223) and prognostic score (IPSS-M, NEJM Evid 2022;1:7). Starting from the MDS cohort available in 2014 (n=944, Leukemia 2014;28:241), we generated 300% augmented synthetic dataset. HD were applied to synthetic data to define genomic-based clinical entities, resulting in the identification of the same 8 subgroups described in a real cohort of 2,043 patients many years later. Moreover, we applied a CoxPH model to the synthetic dataset to generate a molecular prognostic score (IPSS-M_Syn). The model was based on similar molecular features as the real IPSS-M and identified 6 risk categories in which the probability of survival was similar to that of IPSS-M risk groups (Figure 2). Conclusion GAN-generated synthetic data recapitulate statistical properties and complexity of clinical and genomic features in different hematological malignancies, replicate reliable survival estimates and allow effective data augmentation. The implementation of this technology seems to accelerate precision medicine research in hematology. Figure 1View largeDownload PPTFigure 1View largeDownload PPT Close modal
Read full abstract