Combining machine learning and conventional statistical approaches for risk factor discovery in a large cohort study

Iqbal Madakkatel,Mark D Mcdonnell,Ang Zhou,Elina Hyppönen

doi:10.1038/s41598-021-02476-9

Iqbal Madakkatel, Mark D Mcdonnell + Show 2 more

Open Access

https://doi.org/10.1038/s41598-021-02476-9

Copy DOI

Abstract

We present a simple and efficient hypothesis-free machine learning pipeline for risk factor discovery that accounts for non-linearity and interaction in large biomedical databases with minimal variable pre-processing. In this study, mortality models were built using gradient boosting decision trees (GBDT) and important predictors were identified using a Shapley values-based feature attribution method, SHAP values. Cox models controlled for false discovery rate were used for confounder adjustment, interpretability, and further validation. The pipeline was tested using information from 502,506 UK Biobank participants, aged 37–73 years at recruitment and followed over seven years for mortality registrations. From the 11,639 predictors included in GBDT, 193 potential risk factors had SHAP values ≥ 0.05, passed the correlation test, and were selected for further modelling. Of the total variable importance summed up, 60% was directly health related, and baseline characteristics, sociodemographics, and lifestyle factors each contributed about 10%. Cox models adjusted for baseline characteristics, showed evidence for an association with mortality for 166 out of the 193 predictors. These included mostly well-known risk factors (e.g., age, sex, ethnicity, education, material deprivation, smoking, physical activity, self-rated health, BMI, and many disease outcomes). For 19 predictors we saw evidence for an association in the unadjusted but not adjusted analyses, suggesting bias by confounding. Our GBDT-SHAP pipeline was able to identify relevant predictors ‘hidden’ within thousands of variables, providing an efficient and pragmatic solution for the first stage of hypothesis free risk factor identification.

Highlights

We present a simple and efficient hypothesis-free machine learning pipeline for risk factor discovery that accounts for non-linearity and interaction in large biomedical databases with minimal variable pre-processing
Machine learning (ML), “the study of computer algorithms that allow computer programs to automatically improve through experience”[1], provides some attractive solutions for many of these challenges, and they have been found to be effective in developing predictive models based on large sets of variables
We use penalized (LASSO) logistic regression as an alternative baseline approach and include comparisons with another feature selection method ( XGBoost[18] using five different built-in ways of calculating feature importance, as done by other studies19–22), and as we describe in this paper, our data suggests that the proposed gradient boosting decision trees (GBDT)-SHAP pipeline has certain advantages over them

Summary

Introduction

We present a simple and efficient hypothesis-free machine learning pipeline for risk factor discovery that accounts for non-linearity and interaction in large biomedical databases with minimal variable pre-processing. Cox models adjusted for baseline characteristics, showed evidence for an association with mortality for 166 out of the 193 predictors These included mostly well-known risk factors (e.g., age, sex, ethnicity, education, material deprivation, smoking, physical activity, self-rated health, BMI, and many disease outcomes). Cohort studies and biobanks available for medical research are growing, both in the number of individuals included and the density of information available for the participants These large databases hold enormous potential for innovation and provide exciting prospects for hypothesis free risk factor discovery. We use penalized (LASSO) logistic regression as an alternative baseline approach and include comparisons with another feature selection method ( XGBoost[18] using five different built-in ways of calculating feature importance, as done by other studies19–22), and as we describe in this paper, our data suggests that the proposed GBDT-SHAP pipeline has certain advantages over them

Methods

Results

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Scientific Reports	Publication Date: Nov 26, 2021
Citations: 22	License type: open-access

R Discovery Prime

R Discovery Prime

Combining machine learning and conventional statistical approaches for risk factor discovery in a large cohort study

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Scientific Reports

Lead the way for us

Similar Papers

8P A gradient boosting decision tree (GBDT) approach to identify potential therapeutic targets
I.K Kifer ... G Tarcic
Annals of Oncology | VOL. 33
I.K Kifer, et. al.I.K Kifer ... G Tarcic
01 Oct 2022
Annals of Oncology | VOL. 33

Predicting Adverse Drug Events in Chinese Pediatric Inpatients With the Associated Risk Factors: A Machine Learning Study.
Ze Yu ... Tingting Tang
Frontiers in pharmacology | VOL. 12
Ze Yu, et. al.Ze Yu ... Tingting Tang
27 Apr 2021
Frontiers in pharmacology | VOL. 12

Machine learning constructs a diagnostic prediction model for calculous pyonephrosis
Bin Yang ... Jianhe Liu
Urolithiasis | VOL. 52
Bin Yang, et. al.Bin Yang ... Jianhe Liu
19 Jun 2024
Urolithiasis | VOL. 52

The development of a machine-learning approach to construct a field-scale rock-physics transform
Ian Gottschalk ... Rosemary Knight
GEOPHYSICS | VOL. 87
Ian Gottschalk, et. al.Ian Gottschalk ... Rosemary Knight
27 Dec 2021
GEOPHYSICS | VOL. 87

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Combining machine learning and conventional statistical approaches for risk factor discovery in a large cohort study

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Scientific Reports