Abstract
Abstract Background and Aims Synthetic data can be an effective supplement or alternative to real data for the training of machine learning models. Synthetic data may also be used to evaluate new tools, develop educational curricula, or remove undesirable biases in datasets. We aim to evaluate four synthetic data generation methods applied to hypertension randomized clinical trial data. Method The Systolic Blood Pressure Intervention Trial (SPRINT) trial showed that intensive BP control to SBP <120 mm Hg results in significant cardiovascular benefits in high-risk patients with hypertension compared with routine BP control to <140 mm Hg. The Synthetic Data Vault (SDV) is a Synthetic Data Generation ecosystem of libraries that allows users to easily generate new Synthetic Data that has the same format and statistical properties as the original dataset. SDV supports multiple types of data, including date-times, discrete-ordinal, categorical, and numerical. SPRINT data was pre-processed to create a single table of 140,000 patient visits with baseline variables (age, sex, race, aspirin use, estimated Glomerular Filtration Rate (eGFR)) and visit level variables (systolic and diastolic blood pressure, heart rate and total number of antihypertensive medications at end of visit). Using the SDV library for python, we used four generative models to create synthetic SPRINT data, 1. Gaussian copula model, 2. Conditional Tabular Generative adversarial network (CTGAN), 3. CopulaGan model, and 4. Tabular Variational Auto-encode (TVAE). We evaluated the results using the SDMetrics library which includes the shapes of the columns (marginal distributions), the pairwise trends between the columns (correlations), reproduce mathematical properties from your original data and new row synthesis. Finally, an overall quality score which represents an amalgamation of the marginal distribution and correlations was computed, where 0 indicates the lowest quality and 1 indicates the highest. Results Two hundred thousand synthetic patient visits were created for each method. The overall quality scores in order were 90.67% for Gaussian copula, 86.77% for TVAE, 81.03% for CTGAN’, and 79.7% for CopulaGAN. The column shape score which represents the marginal distribution was highest for Gaussian Copula (94.54%), followed by TVAE (88.44%), CTGAN (82.35%), and Copula GAN (80.27%). The column pair trend which corresponds to correlations was highest for Gaussian Copula (86.8%), followed by TAVE (85.1%), CTGAN (79.72%), and Copula GAN (79.12%). Conclusion Gaussian copula created the highest scoring synthetic SPRINT data based on the marginal distribution, correlations, and overall score. The Synthetic Data Vault is a feasible collection of methods for generation of synthetic clinical trial data for training future machine learning and AI models.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.