OBJECTIVES The dynamic and evolving clinical trial landscape, with its nascent incorporation of real-world data, has the potential to transform healthcare. With these exciting developments come challenges in protecting patient privacy, while retaining data information and encouraging transparency and reproducibility of methods. An emerging technology to ameliorate this challenge is synthetic data generation (SDG). The promises of SDG are to provide realistic, representative, and sharable data that retains all the potential learning of the original (parent) data. We demonstrate a flexible framework for generating and evaluating synthetic data from a challenging clinical use case of chronic lymphocytic leukemia (CLL) patients. We demonstrate the potential to generate a synthetic cohort using real-world data on a cohort of patients treated with chimeric antigen receptor T-cell (CAR T) therapy. Many analytes, and their longitudinal progression, have been shown to predict clinical outcomes in CLL. Therefore, when generating synthetic data for this cohort, it is essential to preserve these longitudinal analyte 'fingerprints’ associated with other clinical information to capture latent disease progression adequately. METHODS We used Generative Adversarial Networks (GANs), a type of unsupervised deep learning algorithm, to generate synthetic CLL patients and their latent disease progression over time. 389 Patients were identified within a large tertiary healthcare system (providing care to approximately 5mil patients) by ICD9/10 codes who could provide longitudinal values for the synthetic cohort. We simulated synthetic patient data using EMR (Electronic Medical Record) data, including laboratory test values, patient-reported health state utility values (HSUVs), and other baseline characteristics. RESULTS Clinical attributes showed a strong relation between analyte trajectories and outcomes. Synthetic data was indistinguishable from original data in both statistical tests and in performance in machine learning algorithms to predict disease progression and worsening outcomes. Wasserstein Conditional GAN outperformed vanilla GAN, conditional GAN, and Wasserstein GAN. Synthetic patient data generated by GAN accurately reflect the means, standard deviations, and correlations of each variable over time to the extent that synthetic data cannot be distinguished from actual data by a logistic regression. Moreover, our unsupervised model predicts changes in total HSUVs with the same accuracy as specifically trained supervised models, additionally capturing the correlation structure of the covariates. LIMITATIONS and CONCLUSIONS Many synthetic data-generative methods emphasize retention of relationship among data elements and may preclude certain data anomalies. The real-world data may retain properties associated with the experiment or data generation process and carry them over into the synthetic cohort. The ideal synthetic cohort would support any statistical discovery possible in, and be verifiable against, the parent dataset - while reducing probability of patient identification to zero. This application of statistical tests to evaluate deep learning algorithms provides a novel perspective on synthetic data generation and poses the bases for the establishment of best practices for synthetic data quality assessment.
Read full abstract