Construction and validation of risk prediction models for pulmonary embolism in hospitalized patients based on different machine learning methods.

Tao Huang,Zhihai Huang,Xiaodong Peng,Lingpin Pang,Jie Sun,Jinbo Wu,Jinman He,Kaili Fu,Jun Wu,Xishi Sun

doi:10.3389/fcvm.2024.1308017

Abstract

This study aims to apply different machine learning (ML) methods to construct risk prediction models for pulmonary embolism (PE) in hospitalized patients, and to evaluate and compare the predictive efficacy and clinical benefit of each model. We conducted a retrospective study involving 332 participants (172 PE positive cases and 160 PE negative cases) recruited from Guangdong Medical University. Participants were randomly divided into a training group (70%) and a validation group (30%). Baseline data were analyzed using univariate analysis, and potential independent risk factors associated with PE were further identified through univariate and multivariate logistic regression analysis. Six ML models, namely Logistic Regression (LR), Decision Tree (DT), Random Forest (RF), Naive Bayes (NB), Support Vector Machine (SVM), and AdaBoost were developed. The predictive efficacy of each model was compared using the receiver operating characteristic (ROC) curve analysis and the area under the curve (AUC). Clinical benefit was assessed using decision curve analysis (DCA). Logistic regression analysis identified lower extremity deep venous thrombosis, elevated D-dimer, shortened activated partial prothrombin time, and increased red blood cell distribution width as potential independent risk factors for PE. Among the six ML models, the RF model achieved the highest AUC of 0.778. Additionally, DCA consistently indicated that the RF model offered the greatest clinical benefit. This study developed six ML models, with the RF model exhibiting the highest predictive efficacy and clinical benefit in the identification and prediction of PE occurrence in hospitalized patients.

Full Text