Synthetic Minority Oversampling Technique Research Articles

Imbalanced classification is a common issue in Machine Learning, particularly when misclassifying minor instances leads to significant costs. In literature, various strategies have been employed to address this problem. These include data-level, algorithm-level, cost-sensitive, and hybrid-level algorithms designed to tackle imbalanced problems. This paper aims to introduce a novel method that simultaneously enhances the ability of classification models to identify patterns more effectively and addresses imbalanced problems while minimizing alterations to the original data distribution. Our proposed framework combines ensemble learning, space partitioning, and the Synthetic Minority Oversampling Technique (SMOTE). This method decomposes the space into balanced sub-spaces and then trains an ensemble classifier based on these sub-spaces using a bagging approach. In the initial step, we develop a Space Partitioning by Metaheuristic algorithm (SPMH) to divide the space into multiple balanced subspaces. In the subsequent step, we present Imbalanced Classification by SPMH (ICSPMH) as a solution to imbalanced class problems. ICSPMH uses SPMH multiple times to divide the space into different sub-spaces, creating various sub-spaces each time. It then trains different classifiers for each portion of the space, creating an ensemble classifier through a bagging technique. To assess the performance of our proposed framework, we selected 44 well-known datasets for comparison with some state-of-the-art approaches. The results demonstrate that ICSPMH outperforms other competent methods and can potentially reduce the oversampling rate to zero. Additionally, an experiment indicated that the choice of metaheuristic algorithm in SPMH does not significantly impact the final performance. The paper also includes a correlation analysis between oversampling rate and final performance, revealing that the framework effectively eliminates imbalanced data problems with minimal changes to the original dataset. In summary, because ICSPMH applies fewer changes in data distribution and sets up local classifiers that improve classification performance, it looks like a promising method for classifying imbalanced datasets.

Read full abstract

Abstract Introduction: Among the Hispanic/Latinx population in the US, four of five leading causes of death are attributed to smoking. Few evidence-based interventions have been developed to improve smoking cessation for this population. Our team has demonstrated the efficacy of a culturally targeted, extended self-help intervention for Spanish-speaking smokers, estimating a smoking abstinence rate of 33% at 24 months, compared to 24% for usual care. Further efforts are needed to enhance this efficacious, low-cost, scalable intervention approach. Machine learning (ML) is one approach to increase understanding of the predictors of treatment outcomes and inform strategies to improve the intervention. Toward that goal, this secondary data analysis utilized the decision tree (DT) model to predict self-reported 7-day point prevalence abstinence at 18-month follow-up for participants receiving our intervention. Method: Data from participants who reported smoking status at an 18-month follow-up (N=332 of 714 enrolled, 36% abstinent) were randomly split (80:20) into training and test datasets based on smoking status. We entered demographics, psychosocial, and smoking variables collected at baseline as predictors in the model. In addition to handling missing values and normalizing numeric features, we employed the Synthetic Minority Over-sampling Technique (SMOTE) combined with Tomek links (SMOTE-Tomek) to improve inter-class separability. Recursive Feature Elimination (RFE) using DT was implemented to identify the most relevant predictors. The cross-validation (CV) pipeline incorporated preprocessing, feature selection, handling class-imbalance, over-sampling, and model training with a DT as the interpretable classifier. Hyperparameters were tuned using stratified K-fold (K=5) CV to optimize parameters using Grid search. F1 score was selected as the primary metric as it accounts for each class performance by combining the precision and recall metrics. Finally, the performance of the classifier was evaluated in 20% of the unseen dataset. Results: The DT classifier for the CV achieved the F1 score of 0.73 [95% CI 0.66, 0.77], revealing a reasonably good performance in identifying smokers at 18-months. RFE selected familism, age, affect, and confidence as the most relevant predictors of smoking status. Individuals who had lower familism scores (&lt;138), combined with higher age (&gt;50), were more likely to be smokers at 18 months. Individuals who had higher familism scores (&gt;138) combined with higher negative affect (&gt;42) were more likely to be smokers. Conclusion: This study provides the first step toward personalized care for smoking cessation among the Hispanic/Latinx population and demonstrates the potential of DT classifier to predict cessation outcomes among individuals who received and completed a culturally and linguistically targeted smoking cessation intervention. Our future analysis will compare this model with conventional statistical modeling approaches and other ML algorithms to identify the best-performing parsimonious model. Citation Format: Ranjita Poudel, K. Ruwani M. Fernando, Matthew B. Schabath, Steven K. Sutton, Thomas H. Brandon, Issam El Naqa, Vani N. Simmons. A machine learning approach to predicting smoking cessation outcomes among Spanish-speaking smokers who completed a culturally targeted intervention [abstract]. In: Proceedings of the 17th AACR Conference on the Science of Cancer Health Disparities in Racial/Ethnic Minorities and the Medically Underserved; 2024 Sep 21-24; Los Angeles, CA. Philadelphia (PA): AACR; Cancer Epidemiol Biomarkers Prev 2024;33(9 Suppl):Abstract nr B019.

Read full abstract

Synthetic Minority Oversampling Technique Research Articles

Related Topics

Articles published on Synthetic Minority Oversampling Technique

Metaheuristic-driven space partitioning and ensemble learning for imbalanced classification

An optimized multi-layer ensemble model for airborne networks intrusion detection

A precise machine learning model: Detecting cervical cancer using feature selection and explainable AI

Research on variety identification of common bean seeds based on hyperspectral and deep learning

COMPARATIVE ANALYSIS OF STATE-OF-THE-ART CLASSIFIERS FOR PARKINSON'S DISEASE DIAGNOSIS

Classifying Legendary Pokémon with SF-Random Forest Algorithm

X-ray Image Analysis for Dental Disease: A Deep Learning Approach Using EfficientNets

Abstract B019: A machine learning approach to predicting smoking cessation outcomes among Spanish-speaking smokers who completed a culturally targeted intervention

Enhancing Security and Performance in Vehicular Adhoc Networks: A Machine Learning Approach to Combat Adversarial Attacks

Feature Engineering for Agile Requirement Management Using Semantic Analysis

Synthetic minority oversampling and iterative fluorescence-suppression integrated algorithm for Raman spectrum pesticide detection system

Influence of Preprocessing Methods of Automated Milking Systems Data on Prediction of Mastitis with Machine Learning Models

A hybrid‐ensemble model for software defect prediction for balanced and imbalanced datasets using AI‐based techniques with feature preservation: SMERKP‐XGB

An intelligent hybrid model for cyber attack classification with selected feature set

Improved phase prediction of high-entropy alloys assisted by imbalance learning

Applying deep learning-based ensemble model to [18F]-FDG-PET-radiomic features for differentiating benign from malignant parotid gland diseases.

Harnessing Machine Learning to Predict Methadone Overdose Risk: Insights from Illinois's SUDORS Data

Optimizing Prehospital Stroke Diagnosis: Integrating Machine Learning with the FAST Scoring System

Employee attrition prediction with convolutional neural network and synthetic minority over-sampling technique

Challenges and opportunities of generative models on tabular data

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Synthetic Minority Oversampling Technique Research Articles

Related Topics

Articles published on Synthetic Minority Oversampling Technique

Metaheuristic-driven space partitioning and ensemble learning for imbalanced classification

An optimized multi-layer ensemble model for airborne networks intrusion detection

A precise machine learning model: Detecting cervical cancer using feature selection and explainable AI

Research on variety identification of common bean seeds based on hyperspectral and deep learning

COMPARATIVE ANALYSIS OF STATE-OF-THE-ART CLASSIFIERS FOR PARKINSON'S DISEASE DIAGNOSIS

Classifying Legendary Pokémon with SF-Random Forest Algorithm

X-ray Image Analysis for Dental Disease: A Deep Learning Approach Using EfficientNets

Abstract B019: A machine learning approach to predicting smoking cessation outcomes among Spanish-speaking smokers who completed a culturally targeted intervention

Enhancing Security and Performance in Vehicular Adhoc Networks: A Machine Learning Approach to Combat Adversarial Attacks

Feature Engineering for Agile Requirement Management Using Semantic Analysis

Synthetic minority oversampling and iterative fluorescence-suppression integrated algorithm for Raman spectrum pesticide detection system

Influence of Preprocessing Methods of Automated Milking Systems Data on Prediction of Mastitis with Machine Learning Models

A hybrid‐ensemble model for software defect prediction for balanced and imbalanced datasets using AI‐based techniques with feature preservation: SMERKP‐XGB

An intelligent hybrid model for cyber attack classification with selected feature set

Improved phase prediction of high-entropy alloys assisted by imbalance learning

Applying deep learning-based ensemble model to [18F]-FDG-PET-radiomic features for differentiating benign from malignant parotid gland diseases.

Harnessing Machine Learning to Predict Methadone Overdose Risk: Insights from Illinois's SUDORS Data

Optimizing Prehospital Stroke Diagnosis: Integrating Machine Learning with the FAST Scoring System

Employee attrition prediction with convolutional neural network and synthetic minority over-sampling technique

Challenges and opportunities of generative models on tabular data