Abstract

Nationwide population-based cohort provides a new opportunity to build an automated risk prediction model based on individuals’ history of health and healthcare beyond existing risk prediction models. We tested the possibility of machine learning models to predict future incidence of Alzheimer’s disease (AD) using large-scale administrative health data. From the Korean National Health Insurance Service database between 2002 and 2010, we obtained de-identified health data in elders above 65 years (N = 40,736) containing 4,894 unique clinical features including ICD-10 codes, medication codes, laboratory values, history of personal and family illness and socio-demographics. To define incident AD we considered two operational definitions: “definite AD” with diagnostic codes and dementia medication (n = 614) and “probable AD” with only diagnosis (n = 2026). We trained and validated random forest, support vector machine and logistic regression to predict incident AD in 1, 2, 3, and 4 subsequent years. For predicting future incidence of AD in balanced samples (bootstrapping), the machine learning models showed reasonable performance in 1-year prediction with AUC of 0.775 and 0.759, based on “definite AD” and “probable AD” outcomes, respectively; in 2-year, 0.730 and 0.693; in 3-year, 0.677 and 0.644; in 4-year, 0.725 and 0.683. The results were similar when the entire (unbalanced) samples were used. Important clinical features selected in logistic regression included hemoglobin level, age and urine protein level. This study may shed a light on the utility of the data-driven machine learning model based on large-scale administrative health data in AD risk prediction, which may enable better selection of individuals at risk for AD in clinical trials or early detection in clinical settings.

Highlights

  • Screening individuals at risk for Alzheimer’s disease (AD) based on medical health records in preclinical stages may lead to early detection of AD pathology and to better therapeutic strategies for delaying the onset of AD1–3

  • Classification performance decreased as the predicting period getting longer; using the definite AD definition, AUC of 0.781 (1 year), 0.739 (2 year), 0.686 (3 year), and 0.662 (4 year); using the probable AD definition, AUC of 0.730 (1 year), 0.645 (2 year), 0.575 (3 year), and 0.602 (4 year)

  • Despite of the limitations inherent to the administrative health data, such as the inability to directly ascertain clinical phenotypes, this study demonstrates its potential utility in AD risk prediction, when combined with data-driven machine learning

Read more

Summary

INTRODUCTION

Screening individuals at risk for Alzheimer’s disease (AD) based on medical health records in preclinical stages may lead to early detection of AD pathology and to better therapeutic strategies for delaying the onset of AD1–3. With the advent of digitalization the amounts of such data have exponentially increased[4] Since it is ubiquitous, cost-effective and enormous, the digitalized healthcare database may be an invaluable resource for testing scalable predictive models for AD and other diseases alike. We test the extents to which a data-driven machine approach harvests salient information from the large-scale healthcare data containing thousands of data of individuals’ health trajectories and make an individual-specific prediction of AD risk. It is important to use sufficiently codes and dementia prescription), in predicting 0 year incidence large data representative of the population. Thorough, longitudinal, administrative healthcare data (e.g., insurance claims and health check-ups) within this database, we constructed and validated data-driven machine learning models to predict future incidence of AD. The results were similar when we used the entire, unbalanced samples for model training and evaluation (Supplementary Table 1), RF showed the

RESULTS
METHODS
Ethical approval

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.