Abstract

The introduction of Electronic Health Records (EHRs) is causing fast transformation in healthcare. EHR contains the patient private information and health history in digital form. Hence, EHR data cannot be shared due to privacy concerns to the Machine Learning(ML) research community, through which we can make the healthcare system smarter and provide quality healthcare services to the patients. As a result, synthetic data is utilised as a backup when real-world data (such as EHR data) is unavailable. Synthetic data can be shared without revealing any private information of the patient. This paper focuses on generating synthetic data from the real dataset. As a use case, we have selected Chronic Kidney Disease(CKD) dataset (real) and generated three datasets – real, synthetic, and a combination of real + synthetic. To test the accuracy of the synthetic data, we ran six supervised machine learning algorithms on these three datasets with all characteristics and reduced features to see if the patient had CKD or not. Supervised ML algorithms on the three datasets are assessed based on the following performance metrics - Confusion Matrix, Accuracy, Recall, Precision, and F1-Score. According to the results, XGBoost surpasses with 100 percent accuracy on all three datasets with full features and a 100 percent accuracy on the mix of real and synthetic datasets with feature reduction.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call