Abstract

AbstractBackgroundMachine learning models have been used to create accurate prediction models for dementia. However, many suffer from overfitting and external validation often results in decreased performance. Pooling data from various sources for model training can improve the generalizability of prediction models. We show here a prediction model for dementia developed on pooled data from the Dementia Risk Prediction Pooling (DRPP) Consortium.MethodData from 11 longitudinal disease cohorts within the DRPP and relevant risk factors (25 in total) were collected and harmonized at baseline and follow‐up exams. An ensemble tree‐based algorithm, LightGBM, was used to create two prediction models for dementia at or before 10 years. The first model contains all variables in the dataset, and the second clinical model excludes the Mini‐Mental State Examination (MMSE) score and APOE genotype. 5‐fold cross‐validation was repeated 1000 times to tune the model hyperparameters (number of leaves, tree depth, learning rate) to yield the greatest area under the curve (AUC). Feature importance of the model was analyzed via individual feature information gain. Analysis was performed in R 4.1.2.ResultsAmong the 55,614 participants from 11 cohorts included in this analysis (Table 1), the first model with all the variables had a cross‐validation AUC of 0.762 (CI: 0.757‐0.767). Counterintuitively, the second model of the without the MMSE and APOE variables yielded an AUC of 0.804 (CI: 0.799‐0.809), which may indicate overfitting in the first model. Feature importance analysis show that age is the most important variable in both models. APOE, MMSE, fasting glucose, and any physical activity are the next most important predictors in the full model (Figure 1). In the second model, fasting glucose, any physical activity, gender, and A1c levels were the next most important predictors (Figure 2).ConclusionBy pooling various data sources, we can train machine learning models for dementia risk prediction on more diverse data. Further work is needed to compare the performance of these models with models trained on single data sources via external validation. A pooled dataset also offers an opportunity to understand how model performance will change given shifts in the underlying population.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call