Abstract

There is a growing demand for the uptake of modern artificial intelligence technologies within healthcare systems. Many of these technologies exploit historical patient health data to build powerful predictive models that can be used to improve diagnosis and understanding of disease. However, there are many issues concerning patient privacy that need to be accounted for in order to enable this data to be better harnessed by all sectors. One approach that could offer a method of circumventing privacy issues is the creation of realistic synthetic data sets that capture as many of the complexities of the original data set (distributions, non-linear relationships, and noise) but that does not actually include any real patient data. While previous research has explored models for generating synthetic data sets, here we explore the integration of resampling, probabilistic graphical modelling, latent variable identification, and outlier analysis for producing realistic synthetic data based on UK primary care patient data. In particular, we focus on handling missingness, complex interactions between variables, and the resulting sensitivity analysis statistics from machine learning classifiers, while quantifying the risks of patient re-identification from synthetic datapoints. We show that, through our approach of integrating outlier analysis with graphical modelling and resampling, we can achieve synthetic data sets that are not significantly different from original ground truth data in terms of feature distributions, feature dependencies, and sensitivity analysis statistics when inferring machine learning classifiers. What is more, the risk of generating synthetic data that is identical or very similar to real patients is shown to be low.

Highlights

  • It is increasingly evident that the use of historical data within health systems can offer huge rewards in terms of increased accuracy, timely diagnoses, the discovery of new knowledge about disease and its progression, and the ability to offer a more personalised prognosis and care pathway for patients[1]

  • The paper is broken down in to three main sections: first, we discuss some of the key issues concerning the generation and use of synthetic data and introduce a method based on probabilistic graphical models; second, we explore a case study using primary care data from the Clinical Practice Research Datalink (CPRD) in the UK

  • This is because we want to ensure that any biases that occur in the ground truth data are synthetic data can be used while ensuring patient privacy

Read more

Summary

Introduction

It is increasingly evident that the use of historical data within health systems can offer huge rewards in terms of increased accuracy, timely diagnoses, the discovery of new knowledge about disease and its progression, and the ability to offer a more personalised prognosis and care pathway for patients[1]. Many issues concerning patient privacy have been highlighted since the introduction of General Data Protection Regulation[3] This includes protections from the identification of an individual’s data within large data samples[4] and the right to explanation for any decision that is made by an automated system[5]. One possible solution to this problem is the use of synthetic data as an alternative to assist in the rapid development and validation of new tools. This data must capture all of the correct (potentially non-linear and multivariate) dependencies and distributions that are apparent in the real data sets, while preserving patient privacy and avoiding the risks of individual identification

Objectives
Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call