Abstract

BackgroundMachine learning (ML) has made a significant impact in medicine and cancer research; however, its impact in these areas has been undeniably slower and more limited than in other application domains. A major reason for this has been the lack of availability of patient data to the broader ML research community, in large part due to patient privacy protection concerns. High-quality, realistic, synthetic datasets can be leveraged to accelerate methodological developments in medicine. By and large, medical data is high dimensional and often categorical. These characteristics pose multiple modeling challenges.MethodsIn this paper, we evaluate three classes of synthetic data generation approaches; probabilistic models, classification-based imputation models, and generative adversarial neural networks. Metrics for evaluating the quality of the generated synthetic datasets are presented and discussed.ResultsWhile the results and discussions are broadly applicable to medical data, for demonstration purposes we generate synthetic datasets for cancer based on the publicly available cancer registry data from the Surveillance Epidemiology and End Results (SEER) program. Specifically, our cohort consists of breast, respiratory, and non-solid cancer cases diagnosed between 2010 and 2015, which includes over 360,000 individual cases.ConclusionsWe discuss the trade-offs of the different methods and metrics, providing guidance on considerations for the generation and usage of medical synthetic data.

Highlights

  • Machine learning (ML) has made a significant impact in medicine and cancer research; its impact in these areas has been undeniably slower and more limited than in other application domains

  • On the small-set From Table 5, we observe that many methods succeeded in capturing the statistical dependence among the variables, Mixture of Product of Multinomials (MPoM), Multivariate Imputation by Chained Equations (MICE)-Logistic Regression (LR), MICE-LR-DESC, and MICE with Decision Tree as classifier (MICE-DT)

  • The results showed that Bayesian Networks, Mixture of Product of Multinomials (MPoM) and categorical latent Gaussian process (CLGP) were capable of capturing variables relationships, considering the data utility metrics used for comparison

Read more

Summary

Introduction

Machine learning (ML) has made a significant impact in medicine and cancer research; its impact in these areas has been undeniably slower and more limited than in other application domains. Large amounts and types of patient data are being electronically collected by healthcare providers, governments, and private industry. While such datasets are potentially highly valuable resources for scientists, they are generally not accessible to the broader research community due to patient privacy concerns. Even when it is possible for a researcher to gain access to such data, ensuring proper data usage and protection is a lengthy process with strict legal requirements. This can severely delay the pace of research and, its translational benefits to patient care. It remains extremely difficult to guarantee that re-identification of individual patients is not a possibility with current approaches

Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.