COMPARISON OF STATISTICAL AND MACHINE LEARNING APPROACHES IN A SURVEY FROM THE EPIDEMIOLOGY OF NON-COMMUNICABLE DISEASES

Kutsenko Vladimir,Asiia Imaeva,Svetlana Shalnova,Elena Yarovaya,Ksenia Pereverdieva,Yulia Balanova

doi:10.1097/01.hjh.0000745448.89364.7a

Abstract

Objective: To determine if the disuse of basic machine learning approaches may cause a significant loss of predictive power in the epidemiology study. <Figure> Design and method: A cross-sectional study «Epidemiology of Cardiovascular Diseases in the Russian federation (ESSE-RF)» was performed in 13 regions in 2012–2013. ESSE-RF included randomly selected 21768 participants aged 25–64 with a response rate > 80 %. Standard epidemiology methods and criteria were used. The current sub-study included 13912 participants. We chose lasso regression (L1Regression) and random forest (RF) as basic machine learning methods and logistic regression (LogRegression) as a basic statistical method. We compared these algorithms on predicting arterial hypertension (AH) from 18 clinical, demographic, and social risk factors from international AH guidelines. We fitted the models on the training sample and assessed their quality on the holdout sample with an AUC metric. We used 1000 random train-holdout splits to exclude randomness. We studied the RF feature impacts with model-agnostic techniques: partial dependence plots (PDPs), feature importances, and feature interaction measures. Statistical analysis was performed using R 3.6.1. Results: LogRegression had better performance (82.11 ± 0.42) than L1Regression (81.90 ± 0.40) and RF (81.60 ± 0.43) respectively. Nine out of the top ten important variables in LogRegression were in the top ten important variables of RF, and vice versa. However, RF feature importance was more compatible with the risk guidelines than the LogRegression one (fig. 1A). In particular, LDL and HDL levels were insignificant in the LogRegression but significant in the RF. PDPs of RF were strictly monotonous and close to linear (fig. 1B). There were no features with an interaction measure greater than 35 % except for age. Using the previous analysis information, we constructed new interpretable features, which improved the AUC of basic LogRegression by 0.21. Conclusions: Data from the ESSE-RF study is homogeneous. Associations between features and AH are approximately linear and non-interactive. Therefore, it is correct to adhere to the basic interpretable statistical algorithms. However, machine learning methods can provide additional information that can improve understanding of risk factors influence.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

COMPARISON OF STATISTICAL AND MACHINE LEARNING APPROACHES IN A SURVEY FROM THE EPIDEMIOLOGY OF NON-COMMUNICABLE DISEASES

Abstract

Talk to us

Similar Papers

More From: Journal of Hypertension

Lead the way for us

Similar Papers

Machine Learning Methods to Better Predict Post-Hematopoietic Stem Cell Transplant (HSCT) Leukemic Relapse in Pediatric Patients with Acute Lymphoblastic Leukemia: Random Forest (RF) Classification Featuring Serial Post-Transplant Lineage-Specific Chimerism
David C Shyr ... Simon E Brewer
Blood | VOL. 136
David C Shyr, et. al.David C Shyr ... Simon E Brewer
05 Nov 2020
Blood | VOL. 136

Predictors of Co-occurring Cardiovascular and Gastrointestinal Disorders among Elderly with Osteoarthritis
Jayeshkumar Patel ... Usha Sambamoorthi
Osteoarthritis and Cartilage Open | VOL. 3
Jayeshkumar Patel, et. al.Jayeshkumar Patel ... Usha Sambamoorthi
11 Mar 2021
Osteoarthritis and Cartilage Open | VOL. 3

Evaluating spray drift from Uncrewed Aerial Spray Systems: A machine learning and variance-based sensitivity analysis of environmental and spray system parameters
Goulet-Fortin Jerome ... Laabs Volker
Science of the Total Environment | VOL. 934
Goulet-Fortin Jerome, et. al.Goulet-Fortin Jerome ... Laabs Volker
14 May 2024
Science of the Total Environment | VOL. 934

Simulation and Reconstruction of Runoff in the High-Cold Mountains Area Based on Multiple Machine Learning Models
Shuyang Wang ... Meiping Sun
Water | VOL. 15
Shuyang Wang, et. al.Shuyang Wang ... Meiping Sun
10 Sep 2023
Water | VOL. 15

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

COMPARISON OF STATISTICAL AND MACHINE LEARNING APPROACHES IN A SURVEY FROM THE EPIDEMIOLOGY OF NON-COMMUNICABLE DISEASES

Abstract

Talk to us

Similar Papers

More From: Journal of Hypertension