COMPARISON OF STATISTICAL AND MACHINE LEARNING APPROACHES IN A SURVEY FROM THE EPIDEMIOLOGY OF NON-COMMUNICABLE DISEASES

Kutsenko Vladimir,Svetlana Shalnova,Ksenia Pereverdieva,Elena Yarovaya,Asiia Imaeva,Yulia Balanova

doi:10.1097/01.hjh.0000745448.89364.7a

Abstract

Objective: To determine if the disuse of basic machine learning approaches may cause a significant loss of predictive power in the epidemiology study. <Figure> Design and method: A cross-sectional study «Epidemiology of Cardiovascular Diseases in the Russian federation (ESSE-RF)» was performed in 13 regions in 2012–2013. ESSE-RF included randomly selected 21768 participants aged 25–64 with a response rate > 80 %. Standard epidemiology methods and criteria were used. The current sub-study included 13912 participants. We chose lasso regression (L1Regression) and random forest (RF) as basic machine learning methods and logistic regression (LogRegression) as a basic statistical method. We compared these algorithms on predicting arterial hypertension (AH) from 18 clinical, demographic, and social risk factors from international AH guidelines. We fitted the models on the training sample and assessed their quality on the holdout sample with an AUC metric. We used 1000 random train-holdout splits to exclude randomness. We studied the RF feature impacts with model-agnostic techniques: partial dependence plots (PDPs), feature importances, and feature interaction measures. Statistical analysis was performed using R 3.6.1. Results: LogRegression had better performance (82.11 ± 0.42) than L1Regression (81.90 ± 0.40) and RF (81.60 ± 0.43) respectively. Nine out of the top ten important variables in LogRegression were in the top ten important variables of RF, and vice versa. However, RF feature importance was more compatible with the risk guidelines than the LogRegression one (fig. 1A). In particular, LDL and HDL levels were insignificant in the LogRegression but significant in the RF. PDPs of RF were strictly monotonous and close to linear (fig. 1B). There were no features with an interaction measure greater than 35 % except for age. Using the previous analysis information, we constructed new interpretable features, which improved the AUC of basic LogRegression by 0.21. Conclusions: Data from the ESSE-RF study is homogeneous. Associations between features and AH are approximately linear and non-interactive. Therefore, it is correct to adhere to the basic interpretable statistical algorithms. However, machine learning methods can provide additional information that can improve understanding of risk factors influence.

Full Text