Background: Stroke is a major global health concern, and risk prediction is essential for primary prevention of stroke. However, uncertainty remains about the optimal methodology for analyzing stroke risk. In this study, we aim to determine the most effective stroke prediction method in a targeted population using machine learning and establish a general pipeline for future analysis. Method: Training set included 70% of data (n=14491) from China Stroke Primary Prevention Trial (CSPPT), a randomized, double-blind, multicenter clinical trial. Internal validation was processed with the rest 30% of CSPPT data (n=6211), and external validation was conducted using a Nested case-control (NCC) dataset (n=2568). Total analyzed participants were hypertensive adults who without prior history of stroke (n=23270). The primary outcome was first stroke. Four received analysis methods were processed and compared: logistic regression (LR), stepwise logistic regression (SLR), extreme gradient boosting (XGBoost), and random forest (RF). Population characteristic data with inclusion and exclusion of laboratory variables were separately analyzed. Accuracy, sensitivity, specificity, kappa, and area under receiver operating characteristic curves (AUCs) were used to make model assessments with AUCs the top concern. Data balancing techniques including random under sampling (RUS) and synthetic minority over sampling technique (SMOTE) were applied to process this unbalanced training set. Findings: Best model performance was observed in RUS applied RF model with laboratory variables. Compared with null models (sensitivity = 0, specificity = 100, mean AUCs = 0·643), data balancing techniques improved overall performance with RUS demonstrated a more satisfactory effect in current study (RUS: sensitivity = 63·9; specificity = 53·7; AUCs = 0.624 (mean). Adding laboratory variables improved performance of analysis methods. All results were reconfirmed in validation sets. The top 10 important variables were determined by the analysis method with the best performance. Interpretation: Among tested methods, the most effective stroke prediction model in targeted population is RUS applied RF. From the insights current study revealed, we provided general frameworks of building machine learning based prediction models. Funding Information: The study was supported by funding from the following: Jiangxi Outstanding Person Foundation [20192BCBL23024]; Key RD the National Natural Science Foundation of China [81960074, 81730019, 81973133]; Jiangxi Provincial Health Commission [202130440]; the National Key Research and Development Program [2016YFE0205400, 2018ZX09739010, 2018ZX09301034003], the Science and Technology Planning Project of Guangzhou, China [201707020010]; the Science, Technology and Innovation Committee of Shenzhen [JSGG20170412155639040, GJHS20170314114526143, JSGG20180703155802047]; the Economic, Trade and Information Commission of Shenzhen Municipality [20170505161556110, 20170505160926390]; the Research Fund Program of Guangdong Provincial Key Laboratory of Renal Failure Research, Clinical Innovation Research Program of Guangzhou Regenerative Medicine and Health Guangdong Laboratory [2018GZR0201003]. Declaration of Interests: Dr. Xiao Huang reports grants from Jiangxi Outstanding Person Foundation [20192BCBL23024], Key RD Jiangxi Provincial Health Commission [202130440]; Dr. Xiping Xu reports grants from the National Key Research and Development Program [2016YFE0205400, 2018ZX09739010, 2018ZX09301034003], the Department of Science and Technology of Guangdong Province [2020B121202010], the Science and Technology Planning Project of Guangzhou, China [201707020010], the Science, Technology and Innovation Committee of Shenzhen [GJHS20170314114526143, JSGG20180703155802047], the Economic, Trade and Information Commission of Shenzhen Municipality [20170505161556110, 20170505160926390, 201705051617070]. Dr. Xianhui Qin reports grants from the National Natural Science Foundation of China [81730019, 81973133]; the Research Fund Program of Guangdong Provincial Key Laboratory of Renal Failure Research, Clinical Innovation Research Program of Guangzhou Regenerative Medicine and Health Guangdong Laboratory [2018GZR0201003]. No other disclosures were reported. Ethics Approval Statement: Two data sets with similar baseline characteristics investigated by the same team were selected and analyzed in our study: The China Stroke Primary Prevention Trial (CSPPT) data set and the nested case-control (NCC) data set which is a subset from the H-type Hypertension and Stroke Prevention and Control Project (HSPCP). Both studies were approved by the Ethics Committee of the Institute of Biomedicine, Anhui Medical University, Hefei, China and all participants from both studies provided written informed consent.
Read full abstract