Clinical Implication of Machine Learning in Predicting the Occurrence of Cardiovascular Disease Using Big Data (Nationwide Cohort Data in Korea)

Gihun Joo,Hyeonseung Im,Yeongjin Song,Junbeom Park

doi:10.1109/access.2020.3015757

Abstract

Machine learning (ML) and large-scale big data are key factors in developing an accurate prediction model for cardiovascular disease (CVD). Although the CVD risk often depends on the race and ethnicity, most previous studies considered only US or European populations for the CVD risk prediction. In this work, to complement previous researches, we analyzed the Korean National Health Insurance Service-National Health Sample Cohort (KNHSC) data and studied the characteristics of ML and big data for predicting the CVD risk. More specifically, we assessed the effectiveness of various ML methods in predicting the 2-year and 10-year risk of CVD such as atrial fibrillation, coronary artery disease, heart failure, and strokes. To develop prediction models, we considered the usual medical examination data, questionnaire survey results, comorbidities, and past medication information available in the KNHSC data. We developed various ML-based prediction models using logistic regression, deep neural networks, random forests, and LightGBM, and validated them using various metrics such as receiver operating characteristic curves, precision-recall curves, sensitivity, specificity, and F1 score. Experimental results showed that all ML models outperformed the baseline method derived from the ACC/AHA guidelines for estimating the 10-year CVD risk, demonstrating the usefulness of ML methods. In addition, in our analysis, whether we included the past medication information as a feature or not, the prediction accuracy of all ML models was comparable to each other. Since the use of medications by the physicians provided important information on the occurrence of diseases, when we included it as a feature, all prediction models achieved a slightly higher prediction accuracy.

Highlights

Representative cardiovascular disease (CVD) includes myocardial infarctions, atrial fibrillation, heart failure, and strokes
To complement previous researches, we developed various Machine learning (ML) models based on logistic regression, deep neural networks, random forests [13], and LightGBM [14] to predict the risk of CVD using systematically organized largescale nationwide health examination data in Korea [15], [16]
STUDY POPULATION In this study, we developed ML-based prediction models for CVD such as atrial fibrillation (AF), coronary artery disease (CAD), heart failure (HF), and strokes by analyzing the Medical Check-up Cohort DB ver 1.0 provided by Korean National Health Insurance Service [15], [16] (NHIS-20162-263)

Summary

Introduction

Representative cardiovascular disease (CVD) includes myocardial infarctions, atrial fibrillation, heart failure, and strokes. The occurrence of CVD is affected by various risk factors such as the race, ethnicity, age, sex, weight, height, body mass index, and a blood test result including the kidney function, liver function, and cholesterol levels [1]–[4]. Factors are often intertwined and affect the development of various diseases in a complicated way. Prediction models based on conventional statistical methods often cannot reflect all the complex causal relationships between various risk factors [5], [6]. The recent standardization of medical big data and the systematization of national health examination data have made it possible to analyze previously unknown risk factors that may have a statistically significant association with the occurrence of disease, which may in turn allow us to trace back.

Methods

Results

Discussion

Conclusion