Comparison of Synthetic Data Generation Techniques for Control Group Survival Data in Oncology Clinical Trials: Simulation Study.

Ippei Akiya,Takuma Ishihara,Keiichi Yamamoto

doi:10.2196/55118

Abstract

Synthetic patient data (SPD) generation for survival analysis in oncology trials holds significant potential for accelerating clinical development. Various machine learning methods, including classification and regression trees (CART), random forest (RF), Bayesian network (BN), and conditional tabular generative adversarial network (CTGAN), have been used for this purpose, but their performance in reflecting actual patient survival data remains under investigation. The aim of this study was to determine the most suitable SPD generation method for oncology trials, specifically focusing on both progression-free survival (PFS) and overall survival (OS), which are the primary evaluation end points in oncology trials. To achieve this goal, we conducted a comparative simulation of 4 generation methods, including CART, RF, BN, and the CTGAN, and the performance of each method was evaluated. Using multiple clinical trial data sets, 1000 data sets were generated by using each method for each clinical trial data set and evaluated as follows: (1) median survival time (MST) of PFS and OS; (2) hazard ratio distance (HRD), which indicates the similarity between the actual survival function and a synthetic survival function; and (3) visual analysis of Kaplan-Meier (KM) plots. Each method's ability to mimic the statistical properties of real patient data was evaluated from these multiple angles. In most simulation cases, CART demonstrated the high percentages of MSTs for synthetic data falling within the 95% CI range of the MST of the actual data. These percentages ranged from 88.8% to 98.0% for PFS and from 60.8% to 96.1% for OS. In the evaluation of HRD, CART revealed that HRD values were concentrated at approximately 0.9. Conversely, for the other methods, no consistent trend was observed for either PFS or OS. CART demonstrated better similarity than RF, in that CART caused overfitting and RF (a kind of ensemble learning approach) prevented it. In SPD generation, the statistical properties close to the actual data should be the focus, not a well-generalized prediction model. Both the BN and CTGAN methods cannot accurately reflect the statistical properties of the actual data because small data sets are not suitable. As a method for generating SPD for survival data from small data sets, such as clinical trial data, CART demonstrated to be the most effective method compared to RF, BN, and CTGAN. Additionally, it is possible to improve CART-based generation methods by incorporating feature engineering and other methods in future work.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Comparison of Synthetic Data Generation Techniques for Control Group Survival Data in Oncology Clinical Trials: Simulation Study.

Abstract

Talk to us

Similar Papers

More From: JMIR medical informatics

Lead the way for us

Journal: JMIR medical informatics	Publication Date: Jun 18, 2024
License type: cc-by

Similar Papers

Machine learning model to predict oncologic outcomes for drugs in randomized clinical trials.
Alexander V Schperberg ... Stéphane B Richard
International Journal of Cancer | VOL. 147
Alexander V Schperberg, et. al.Alexander V Schperberg ... Stéphane B Richard
19 Aug 2020
International Journal of Cancer | VOL. 147

Abstract A10: Soluble CD25 and C-reactive protein predict overall survival in melanoma patients receiving anti-CD40 monoclonal antibody CP-870,893 (αCD40) and anti-CTLA4 monoclonal antibody tremelimumab
Rosemarie Mick ... David Bajor
Cancer Immunology Research | VOL. 3
Rosemarie Mick, et. al.Rosemarie Mick ... David Bajor
01 Oct 2015
Cancer Immunology Research | VOL. 3

Performance Comparison of Random Forest (RF) and Classification and Regression Trees (CART) for Hotel Star Rating Prediction
Annisaa Utami ... Jumanto Unjung
Scientific Journal of Informatics | VOL. 11
Annisaa Utami, et. al.Annisaa Utami ... Jumanto Unjung
22 Oct 2024
Scientific Journal of Informatics | VOL. 11

The Classification Performance and Mechanism of Machine Learning Algorithms in Winter Wheat Mapping Using Sentinel-2 10 m Resolution Imagery
Peng Fang ... Yuanzheng Wang
Applied Sciences | VOL. 10
Peng Fang, et. al.Peng Fang ... Yuanzheng Wang
23 Jul 2020
Applied Sciences | VOL. 10

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Comparison of Synthetic Data Generation Techniques for Control Group Survival Data in Oncology Clinical Trials: Simulation Study.

Abstract

Talk to us

Similar Papers

More From: JMIR medical informatics