Synthetic Minority Oversampling Technique Research Articles

BackgroundDue to the ageing population and evolving lifestyles occurring in China, middle-aged and elderly populations have become high-risk groups for cardiovascular disease (CVD). The aim of this study was to analyse the incidence characteristics of CVD in these populations and develop a prediction model by using data from the China Health and Retirement Longitudinal Study (CHARLS).MethodsWe used follow-up data from the CHARLS to analyse CVD incidence in the Chinese middle-aged and elderly population over a time span of 9 years. Five machine learning (ML) algorithms were employed for risk prediction. Data preprocessing included missing value imputation via random forest. Feature selection was performed using the Least Absolute Shrinkage and Selection Operator (Lasso CV) method with cross-validation prior to model training. The application of the synthetic minority over-sampling technique (SMOTE) to address class imbalance. Model performance was evaluated via analyses including the area under the ROC curve (AUC), precision, recall, F1 score, and SHAP plots for interpretability.ResultsIn accordance with the exclusion criteria, 12,580, 12,061, 11,545, and 11,619 participants were enrolled in four follow-up rounds. The cumulative incidence (CI) of CVD at 2, 4, 7, and 9 years was 2.846%, 8.971%, 17.869% and 20.518%,, respectively. Significant differences in CVD incidence were observed across gender, age, ethnicity, and region, with higher rates observed in females and in the northeast region. Ultimately, 8,080 participants and 24 features were analysed for CVD risk prediction. Five ML models were built based on these features. Although the LGB model achieves an AUC of 0.818, indicating strong overall performance, its F1 score and recall rate are relatively low, at 0.509 and 43.1%, respectively. Shapley additive explanations (SHAP) analyses revealed the importance of key features, such as night sleep duration, TG levels, and waist circumference, in predicting outcomes, and highlighted the nonlinear relationships between these features and CVD risk.ConclusionsGender, age, ethnicity, and region are significant factors influencing CVD incidence. Although the LGB model demonstrates good overall performance, its low F1 score and recall rate reveal limitations in identifying high-risk cardiovascular disease patients.

Read full abstract

BackgroundAlveolar bone loss (ABL) is common in modern society. Heavy metal exposure is usually considered to be a risk factor for ABL. Some studies revealed a positive trend found between urinary heavy metals and periodontitis using multiple logistic regression and Bayesian kernel machine regression. Overfitting using kernel function, long calculation period, the definition of prior distribution and lack of rank of heavy metal will affect the performance of the statistical model. Optimal model on this topic still remains controversy. This study aimed: (1) to develop an algorithm for exploring the association between heavy metal exposure and ABL; (2) filter the actual causal variables and investigate how heavy metals were associated with ABL; and (3) identify the potential risk factors for ABL.MethodsData were collected from National Health and Nutrition Examination Survey (NHANES) between 2015 and 2018 to develop a machine learning (ML) model. Feature selection was performed using the Least Absolute Shrinkage and Selection Operator (LASSO) regression with 10-fold cross-validation. The selected data were balanced using the Synthetic Minority Oversampling Technique (SMOTE) and divided into a training set and testing set at a 3:1 ratio. Logistic Regression (LR), Support Vector Machines (SVM), Random Forest (RF), K-Nearest Neighbor (KNN), Decision Tree (DT), and XGboost were used to construct the ML model. Accuracy, Area Under the Receiver Operating Characteristic Curve (AUC), Precision, Recall, and F1 score were used to select the optimal model for further analysis. The contribution of the variables to the ML model was explained using the Shapley Additive Explanations (SHAP) method.ResultsRF showed the best performance in exploring the association between heavy metal exposure and ABL, with an AUC (0.88), accuracy (0.78), precision (0.76), recall (0.83), and F1 score (0.79). Age was the most important factor in the ML model (mean| SHAP value| = 0.09), and Cd was the primary contributor. Sex had little effect on the ML model contribution.ConclusionIn this study, RF showed superior performance compared with the other five algorithms. Among the 12 heavy metals, Cd was the most important factor in the ML model. The relationship of Co & Pb and ABL are weaker than that of Cd. Among all the independent variables, age was considered the most important factor for this model. As for PIR, low-income participants present association with ABL. Mexican American and Non-Hispanic White show low association with ABL compared to Non-Hispanic Black and other races. Gender feature demonstrates a weak association with ABL. In the future, more advanced algorithms should be developed to validate these results and related parameters can be tuned to improve the accuracy of the model.Clinical trial numbernot applicable.

Read full abstract

Synthetic Minority Oversampling Technique Research Articles

Related Topics

Articles published on Synthetic Minority Oversampling Technique

Impact of imbalanced datasets on ML algorithms for malware classification

Characterisation of cardiovascular disease (CVD) incidence and machine learning risk prediction in middle-aged and elderly populations: data from the China health and retirement longitudinal study (CHARLS)

Development of a machine learning model related to explore the association between heavy metal exposure and alveolar bone loss among US adults utilizing SHAP: a study based on NHANES 2015–2018

A novel aggregated coefficient ranking based feature selection strategy for enhancing the diagnosis of breast cancer classification using machine learning.

A Deep Learning CNN-GRU-RNN Model for Sustainable Development Prediction in Al-Kharj City

Optimized Random Forest Models for Rock Mass Classification in Tunnel Construction

Smartphone pupillometry with machine learning differentiates ischemic from hemorrhagic stroke: A pilot study.

An intelligent computing methodology for two-phase flow performance assessment of electrical submersible pump using artificial neural network and synthetic minority over-sampling technique

PTSP-BERT: Predict the thermal stability of proteins using sequence-based bidirectional representations from transformer-embedded features.

Counterfactual synthetic minority oversampling technique: solving healthcare’s imbalanced learning challenge

Assessing the environmental determinants of micropollutant contamination in streams using explainable machine learning and network analysis.

Comparative study of imputation strategies to improve the sarcopenia prediction task.

Hyperparameter Tuning of Random Forest using Social Group Optimization Algorithm for Credit Card Fraud Detection in Banking Data

Tunnel squeezing prediction based on partially missing dataset and optimized machine learning models

Enhancing Mental Health Disorder Detection: A Hybrid Classifier System Using Soft Voting and Ensemble Methods

IMPLEMENTATION OF THE PSO-SMOTE METHOD ON THE NAIVE BAYES ALGORITHM TO ADDRESS CLASS IMBALANCE IN LANDSLIDE DISASTER DATA

Advanced sleep disorder detection using multi-layered ensemble learning and advanced data balancing techniques

Machine Learning Approaches for Effective Credit Card Fraud Detection: Addressing Imbalance and Enhancing Accuracy

Unlocking the power of optimized data balancing ratios: a new frontier in tackling imbalanced datasets

Enhanced Hybrid Deep Learning Models-Based Anomaly Detection Method for Two-Stage Binary and Multi-Class Classification of Attacks in Intrusion Detection Systems

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Synthetic Minority Oversampling Technique Research Articles

Related Topics

Articles published on Synthetic Minority Oversampling Technique

Impact of imbalanced datasets on ML algorithms for malware classification

Characterisation of cardiovascular disease (CVD) incidence and machine learning risk prediction in middle-aged and elderly populations: data from the China health and retirement longitudinal study (CHARLS)

Development of a machine learning model related to explore the association between heavy metal exposure and alveolar bone loss among US adults utilizing SHAP: a study based on NHANES 2015–2018

A novel aggregated coefficient ranking based feature selection strategy for enhancing the diagnosis of breast cancer classification using machine learning.

A Deep Learning CNN-GRU-RNN Model for Sustainable Development Prediction in Al-Kharj City

Optimized Random Forest Models for Rock Mass Classification in Tunnel Construction

Smartphone pupillometry with machine learning differentiates ischemic from hemorrhagic stroke: A pilot study.

An intelligent computing methodology for two-phase flow performance assessment of electrical submersible pump using artificial neural network and synthetic minority over-sampling technique

PTSP-BERT: Predict the thermal stability of proteins using sequence-based bidirectional representations from transformer-embedded features.

Counterfactual synthetic minority oversampling technique: solving healthcare’s imbalanced learning challenge

Assessing the environmental determinants of micropollutant contamination in streams using explainable machine learning and network analysis.

Comparative study of imputation strategies to improve the sarcopenia prediction task.

Hyperparameter Tuning of Random Forest using Social Group Optimization Algorithm for Credit Card Fraud Detection in Banking Data

Tunnel squeezing prediction based on partially missing dataset and optimized machine learning models

Enhancing Mental Health Disorder Detection: A Hybrid Classifier System Using Soft Voting and Ensemble Methods

IMPLEMENTATION OF THE PSO-SMOTE METHOD ON THE NAIVE BAYES ALGORITHM TO ADDRESS CLASS IMBALANCE IN LANDSLIDE DISASTER DATA

Advanced sleep disorder detection using multi-layered ensemble learning and advanced data balancing techniques

Machine Learning Approaches for Effective Credit Card Fraud Detection: Addressing Imbalance and Enhancing Accuracy

Unlocking the power of optimized data balancing ratios: a new frontier in tackling imbalanced datasets

Enhanced Hybrid Deep Learning Models-Based Anomaly Detection Method for Two-Stage Binary and Multi-Class Classification of Attacks in Intrusion Detection Systems