Abstract Background SCD is mainly due to lethal ventricular arrhythmias and occurs often among patients with ischemic heart disease. SCD and ischemic heart disease share a similar pattern of cardiovascular risk factors (CVRF). Commonly developed risk scores have been focused on cardiac-related risk factors. We aimed to investigate whether non-CVRF variables could enhance predictive performance beyond standard CVRF of SCD. Method We compared 2 different strategies for variable inclusion. First, we built a prediction model containing EHR data restricted to cardiovascular diseases and risk factors (CVD MR model) that occurred up to 5 years before SCD. We selected all available main cardiovascular variables as surrogate markers for coronary artery disease, stroke, diabetes, hypertension, smoking status, obesity, lipid disorders and chronic renal failure. On the other hand, our global approach (ALL MR model) includes all medical records codes, without any prior selection. To estimate the risk of SCD over three months, we trained a machine learning model on EHR data representing 8,566,229 drug prescriptions and 801,352 hospital diagnoses up to five years prior to SCD. The data were obtained from a cohort of 12,338 SCD in France and 12,338 controls from 2011 to 2015. We then validated the results on two external cohorts: one temporal in the same area between 2016 and 2020 with 11,620 SCD and 11,620 controls and one geographical from the USA with 892 SCD and 892 controls from 2013 to 2021. Results The CVD MR + ML model (CatBoost algorithm) yielded moderate performances with an AUC of 0.68 (95%CI 0.65-0.68), a sensitivity of 45%.(95%CI 42-47) and a specificity of 82% (95%CI 80-84). We found that the EHR model with all medical records (All MR + ML model) offered better performances. In the derivation cohort, this model achieved an AUC of 0.80 (95% CI: 0.78–0.82) with a sensitivity of 67% (95% CI 64-69), a specificity of 80% (95% CI 78-82) (Figure 1). The logistic regression (All MR + LR) applied on all medical records performed better than the CatBoost algorithm when restricted to CVD medical records (Figure 1). Conclusion The inclusion of all potential variables beyond the usual CVRF significantly improves the performances of SCD prediction models, independently of the methods (AI and logistic regression). Figure 1 : Receiver operating characteristic curves of the different models (ALL MR + ML, ALL MR + LR, CVD MR + ML and CVD MR + LR) in the derivation cohort and two validation cohorts (temporal and geographical)Figure 1.ROC curves
Read full abstract