Systematic Evaluation of Synthetic Panel Data Quality with an Application to Chronic Lymphocytic Leukemia

Dimitris Karletsos,Jackie Vanderpuye-Orgle,Andy Wilson

doi:10.1182/blood-2022-171057

Dimitris Karletsos, Jackie Vanderpuye-Orgle + Show 1 more

Open Access

https://doi.org/10.1182/blood-2022-171057

Copy DOI

Export

Save

Cite

Abstract
Full-Text
Similar Papers

Abstract

Listen

OBJECTIVES The dynamic and evolving clinical trial landscape, with its nascent incorporation of real-world data, has the potential to transform healthcare. With these exciting developments come challenges in protecting patient privacy, while retaining data information and encouraging transparency and reproducibility of methods. An emerging technology to ameliorate this challenge is synthetic data generation (SDG). The promises of SDG are to provide realistic, representative, and sharable data that retains all the potential learning of the original (parent) data. We demonstrate a flexible framework for generating and evaluating synthetic data from a challenging clinical use case of chronic lymphocytic leukemia (CLL) patients. We demonstrate the potential to generate a synthetic cohort using real-world data on a cohort of patients treated with chimeric antigen receptor T-cell (CAR T) therapy. Many analytes, and their longitudinal progression, have been shown to predict clinical outcomes in CLL. Therefore, when generating synthetic data for this cohort, it is essential to preserve these longitudinal analyte 'fingerprints’ associated with other clinical information to capture latent disease progression adequately. METHODS We used Generative Adversarial Networks (GANs), a type of unsupervised deep learning algorithm, to generate synthetic CLL patients and their latent disease progression over time. 389 Patients were identified within a large tertiary healthcare system (providing care to approximately 5mil patients) by ICD9/10 codes who could provide longitudinal values for the synthetic cohort. We simulated synthetic patient data using EMR (Electronic Medical Record) data, including laboratory test values, patient-reported health state utility values (HSUVs), and other baseline characteristics. RESULTS Clinical attributes showed a strong relation between analyte trajectories and outcomes. Synthetic data was indistinguishable from original data in both statistical tests and in performance in machine learning algorithms to predict disease progression and worsening outcomes. Wasserstein Conditional GAN outperformed vanilla GAN, conditional GAN, and Wasserstein GAN. Synthetic patient data generated by GAN accurately reflect the means, standard deviations, and correlations of each variable over time to the extent that synthetic data cannot be distinguished from actual data by a logistic regression. Moreover, our unsupervised model predicts changes in total HSUVs with the same accuracy as specifically trained supervised models, additionally capturing the correlation structure of the covariates. LIMITATIONS and CONCLUSIONS Many synthetic data-generative methods emphasize retention of relationship among data elements and may preclude certain data anomalies. The real-world data may retain properties associated with the experiment or data generation process and carry them over into the synthetic cohort. The ideal synthetic cohort would support any statistical discovery possible in, and be verifiable against, the parent dataset - while reducing probability of patient identification to zero. This application of statistical tests to evaluate deep learning algorithms provides a novel perspective on synthetic data generation and poses the bases for the establishment of best practices for synthetic data quality assessment.

Full Text

Published Version

View

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

Systematic Evaluation of Synthetic Panel Data Quality with an Application to Chronic Lymphocytic Leukemia

Abstract

Published Version

Talk to us

Similar Papers

More From: Blood

Lead the way for us

Similar Papers

Synthetic Tabular Data Based on Generative Adversarial Networks in Health Care: Generation and Validation Using the Divide-and-Conquer Strategy.
Ha Ye Jin Kang ... Dong-Woo Choi
JMIR medical informatics | VOL. 11
Ha Ye Jin Kang, et. al.Ha Ye Jin Kang ... Dong-Woo Choi
24 Nov 2023
JMIR medical informatics | VOL. 11

A Naturally Occurring Canine Model of Chronic Lymphocytic Leukemia/Small Lymphocytic Lymphoma: IGHV Mutation Status, Gene Expression, and Clinical Outcome
Emily Rout ... Anne Avery
Blood | VOL. 132
Emily Rout, et. al.Emily Rout ... Anne Avery
29 Nov 2018
Blood | VOL. 132

Synthetic Data Generation By Artificial Intelligence to Accelerate Translational Research and Precision Medicine in Hematological Malignancies
Saverio D'Amico ...
Blood | VOL. 140
Saverio D'Amico, et. al.Saverio D'Amico ...
15 Nov 2022
Blood | VOL. 140

Generative adversarial network based synthetic data training model for lightweight convolutional neural networks.
Ishfaq Hussain Rather ... Sushil Kumar
Multimedia Tools and Applications | VOL. 83
Ishfaq Hussain Rather, et. al.Ishfaq Hussain Rather ... Sushil Kumar
20 May 2023
Multimedia Tools and Applications | VOL. 83

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

Systematic Evaluation of Synthetic Panel Data Quality with an Application to Chronic Lymphocytic Leukemia

Abstract

Published Version

Talk to us

Similar Papers

More From: Blood