Abstract
BackgroundWhile machine learning (ML) algorithms may predict cardiovascular outcomes more accurately than statistical models, their result is usually not representable by a transparent formula. Hence, it is often unclear how specific values of predictors lead to the predictions. We aimed to demonstrate with graphical tools how predictor-risk relations in cardiovascular risk prediction models fitted by ML algorithms and by statistical approaches may differ, and how sample size affects the stability of the estimated relations.MethodsWe reanalyzed data from a large registry of 1.5 million participants in a national health screening program. Three data analysts developed analytical strategies to predict cardiovascular events within 1 year from health screening. This was done for the full data set and with gradually reduced sample sizes, and each data analyst followed their favorite modeling approach. Predictor-risk relations were visualized by partial dependence and individual conditional expectation plots.ResultsWhen comparing the modeling algorithms, we found some similarities between these visualizations but also occasional divergence. The smaller the sample size, the more the predictor-risk relation depended on the modeling algorithm used, and also sampling variability played an increased role. Predictive performance was similar if the models were derived on the full data set, whereas smaller sample sizes favored simpler models.ConclusionPredictor-risk relations from ML models may differ from those obtained by statistical models, even with large sample sizes. Hence, predictors may assume different roles in risk prediction models. As long as sample size is sufficient, predictive accuracy is not largely affected by the choice of algorithm.
Highlights
While machine learning (ML) algorithms may predict cardiovascular outcomes more accurately than statistical models, their result is usually not representable by a transparent formula
Used cardiovascular disease (CVD) risk prediction models such as the Framingham 2008 CVD risk model were statistically estimated by fitting a Cox model with a relatively small number of coefficients [1]
Predictors Similar to the Framingham 2008 CVD risk model, we considered the following predictors: sex, age, total cholesterol, High density lipoprotein (HDL) cholesterol, systolic blood pressure (BP, mmHg), hypertensive drug intake, diabetes, and smoking status
Summary
While machine learning (ML) algorithms may predict cardiovascular outcomes more accurately than statistical models, their result is usually not representable by a transparent formula. It is often unclear how specific values of predictors lead to the predictions. An important caveat of many ML algorithms is that the final model structure is non-transparent and predictions seem to be generated by a ‘black-box’. This impedes reproducibility as well as quantification of a particular predictor-risk relation. Several techniques have been proposed [9, 10], and some of them have been denoted as ‘model-agnostic’ as they can be applied without knowing how a modeling algorithm arrives at predictions
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.