Data sharing is often hindered by concerns of patient privacy, regulatory aspects, and proprietary interests thereby impeding scientific progress and establishing a gatekeeping mechanism in clinical medicine since obtaining large data sets is costly and time-consuming. We employed two different generative artificial intelligence (AI) technologies: CTAB-GAN+ and Normalizing Flows (NFlow) to synthesize clinical trial data based on pooled patient data from four previous multicenter clinical trials of the German Study Alliance Leukemia (AML96, AML2003, AML60+, SORAML) that enrolled adult patients (n=1606) with acute myeloid leukemia (AML) who received intensive induction therapy. As a generative adversarial network (GAN), CTAB-GAN+ consists of two adversarial networks: a generator producing synthetic samples from random noise and a discriminator aiming to distinguish between real and synthetic samples. The model converges as the discriminator can no longer reliably differentiate between real or synthetic data. Contrastingly, NFlow consists of a sequence of invertible transformations (flows) starting from a simple base distribution and gradually adding complexity to better mirror the training data. Both models were trained on tabular data including demographic, laboratory, molecular genetic and cytogenetic patient variables. Detection of molecular alterations in the original cohort was performed via next-generation sequencing (NGS) using the TruSight Myeloid Sequencing Panel (Illumina, San Diego, CA, USA) with a 5% variant-allele frequency (VAF) mutation calling cut-off. For cytogenetics, standard techniques for chromosome banding and fluorescence-in-situ-hybridization (FISH) were used. Hyperparameter tuning of generative models was conducted using the Optuna Framework. For each model, we used a total of 70 optimization trials to optimize a custom score inspired by TabSynDex which assesses both the resemblance of the synthetic data to real training data and its utility. Pairwise analyses were conducted between the original and both synthetic data sets, respectively. All tests were carried out as two-sided tests using a significance level α of 0.05. Table 1 summarizes baseline patient characteristics and outcome for both synthetic cohorts compared to the original cohort. Firstly, we found both models to adequately represent patient features, albeit that some individual variables showed a statistically significant deviation from the original cohort. It is important to note that for such a large sample size (n=1606 for each cohort), even miniscule differences can be rendered statistically significant notwithstanding any meaningful clinical difference. Interestingly, variables that deviated from the original distribution were different for both models indicating model architecture to play a vital role in sample representation: While CTAB-GAN+ showed significant deviations for both age and sex, NFlow showed significant deviations for AML status. Complete remission rate was similar between original (70.7%, odds ratio [OR]: 2.41) and CTAB-GAN+ (73.7%, OR: 2.81, p=0.059) and NFlow (69.1%, OR: 2.24, p=0.356). For event-free survival (EFS), which was not included as a target in hyperparameter tuning, both networks deviated significantly from the original cohort (original: median 7.2 months, HR: 1.36; CTAB-GAN+: median 12.8 months, HR 0.74, p<0.001; NFlow: median 9.0 months, HR: 0.87, p=0.001). Overall survival (OS) was well represented by NFlow compared to the original cohort, while CTAB-GAN+ showed a significant deviation (original: median 17.5 months, HR: 1.14; CTAB-GAN+: median 19.5 months, HR 0.88, p<0.001; NFlow: median 16.2 months, HR: 1.00, p=0.055). Both models showed an adequate graph representation in Kaplan-Meier analysis (Figure 1). Here, we demonstrate using two different generative AI technologies that synthetic data generation provides an attractive solution to circumvent issues in current standards of data collection and sharing. It effectively allows for bypassing logistical, organizational, and financial burdens, as well as regulatory and ethical concerns. Ultimately, this enables explorative research inquiries into previously inaccessible data sets and offers the prospect of fully synthetic control arms in prospective clinical trials.
Read full abstract