Oversampling Algorithms Research Articles

The subject of research in the article is the problem of classification in machine learning in the presence of imbalanced classes in datasets. The purpose of the work is to analyze existing solutions and algorithms for solving the problem of dataset imbalance of different types and different industries and to conduct an experimental comparison of algorithms. The article solves the following tasks: to analyze approaches to solving the problem – preprocessing methods, learning methods, hybrid methods and algorithmic approaches; to define and describe the oversampling algorithms most often used to balance datasets; to select classification algorithms that will serve as a tool for establishing the quality of balancing by checking the applicability of the datasets obtained after oversampling; to determine metrics for assessing the quality of classification for comparison; to conduct experiments according to the proposed methodology. For clarity, we considered datasets with varying degrees of imbalance (the number of instances of the minority class was equal to 15, 30, 45, and 60% of the number of samples of the majority class). The following methods are used: analytical and inductive methods for determining the necessary set of experiments and building hypotheses regarding their results, experimental and graphic methods for obtaining a visual comparative characteristic of the selected algorithms. The following results were obtained: with the help of quality metrics, an experiment was conducted for all algorithms on two different datasets – the Titanic passenger dataset and the dataset for detecting fraudulent transactions in bank accounts. The obtained results indicated the best applicability of SMOTE and SVM SMOTE algorithms, the worst performance of Borderline SMOTE and k-means SMOTE, and at the same time described the results of each algorithm and the potential of their usage. Conclusions: the application of the analytical and experimental method provided a comprehensive comparative description of the existing balancing algorithms. The superiority of oversampling algorithms over undersampling algorithms was proven. The selected algorithms were compared using different classification algorithms. The results were presented using graphs and tables, as well as demonstrated in general using heat maps. Conclusions that were made can be used when choosing the optimal balancing algorithm in the field of machine learning.

Labelled imbalanced data, used for classification problems, have an unequal distribution of samples over the classes. Traditional classification models, such as random forest, gradient boosting, face a problem when dealing with imbalanced datasets. Over 85 oversampling algorithms, mostly extensions of the SMOTE algorithm, have been built over the past two decades, to solve the problem of imbalanced datasets. However, it has been evident from previous studies that different oversampling algorithms have different degrees of efficiency with different classifiers. With numerous algorithms available, it is difficult to decide on an oversampling algorithm for a chosen classifier. Here, we overcome this problem with a multi-schematic and classifier-independent oversampling approach, referred to as ProWRAS (Proximity Weighted Random Affine Shadowsampling). ProWRAS integrates the Localized Random Affine Shadowsampling (LoRAS) algorithm and the Proximity Weighted Synthetic oversampling (ProWSyn) algorithm. By controlling the variance of the synthetic samples, as well as a proximity-weighted clustering system of the minority class data, the ProWRAS algorithm improves performance, compared to algorithms that generate synthetic samples through modelling high dimensional convex spaces of the minority class. ProWRAS is multi-schematic by employing four oversampling schemes, each of which has its unique way to model the variance of the generated data. The proximity weighted clustering approach of ProWRAS allows one to generate low variance synthetic samples only in borderline clusters to avoid overlap with the majority class. Most importantly, the performance of ProWRAS with proper choice of oversampling schemes, is independent of the classifier used. We have benchmarked our newly developed ProWRAS algorithm against five state-of-the-art oversampling models and four different classifiers on 20 publicly available datasets. Our results show that ProWRAS outperforms other oversampling algorithms in a statistically significant way, in terms of both F1-score and $\kappa $ -score. Moreover, we have introduced a novel measure for classifier independence $\mathcal {J}$ -score, and showed quantitatively that ProWRAS performs better, independent of the classifier used. Thus, ProWRAS is highly effective for homogeneous tabular data where convex modelling of the data space can be done. In practice, ProWRAS customizes synthetic sample generation according to a classifier of choice and thereby reduces benchmarking efforts.

Oversampling Algorithms Research Articles

Related Topics

Articles published on Oversampling Algorithms

MLAWSMOTE: Oversampling in Imbalanced Multi-label Classification with Missing Labels by Learning Label Correlation Matrix

Natural local density-based adaptive oversampling algorithm for imbalanced classification

A Synthetic Minority Oversampling Technique Based on Gaussian Mixture Model Filtering for Imbalanced Data Classification.

Improved Oversampling Algorithm for Imbalanced Data Based on K-Nearest Neighbor and Interpolation Process Optimization

An overlapping minimization-based over-sampling algorithm for binary imbalanced classification

Attention features selection oversampling technique (AFS-O) for rolling bearing fault diagnosis with class imbalance

COMPARISON OF DATASET OVERSAMPLING ALGORITHMS AND THEIR APPLICABILITY TO THE CATEGORIZATION PROBLEM

Irregular characteristic analysis of 3D particles—A novel virtual sieving technique

A Novel Approach to Decision-Making on Diagnosing Oncological Diseases Using Machine Learning Classifiers Based on Datasets Combining Known and/or New Generated Features of a Different Nature

Subspace-based minority oversampling for imbalance classification

Feature-Ensemble-Based Crop Mapping for Multi-Temporal Sentinel-2 Data Using Oversampling Algorithms and Gray Wolf Optimizer Support Vector Machine

METAbolomics data Balancing with Over-sampling Algorithms (META-BOA): an online resource for addressing class imbalance.

SW: A weighted space division framework for imbalanced problems with label noise

Seabed Modelling by Means of Airborne Laser Bathymetry Data and Imbalanced Learning for Offshore Mapping.

SA-CGAN: An oversampling method based on single attribute guided conditional GAN for multi-class imbalanced learning

Improved CBSO: A distributed fuzzy-based adaptive synthetic oversampling algorithm for imbalanced judicial data

A Multi-Schematic Classifier-Independent Oversampling Approach for Imbalanced Datasets

An Improving Majority Weighted Minority Oversampling Technique for Imbalanced Classification Problem

Integrating Second-order Moving Average and Over-sampling Algorithm to Predict Apoptosis Protein Subcellular Localization

Identification of Orphan Genes in Unbalanced Datasets Based on Ensemble Learning.

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Oversampling Algorithms Research Articles

Related Topics

Articles published on Oversampling Algorithms

MLAWSMOTE: Oversampling in Imbalanced Multi-label Classification with Missing Labels by Learning Label Correlation Matrix

Natural local density-based adaptive oversampling algorithm for imbalanced classification

A Synthetic Minority Oversampling Technique Based on Gaussian Mixture Model Filtering for Imbalanced Data Classification.

Improved Oversampling Algorithm for Imbalanced Data Based on K-Nearest Neighbor and Interpolation Process Optimization

An overlapping minimization-based over-sampling algorithm for binary imbalanced classification

Attention features selection oversampling technique (AFS-O) for rolling bearing fault diagnosis with class imbalance

COMPARISON OF DATASET OVERSAMPLING ALGORITHMS AND THEIR APPLICABILITY TO THE CATEGORIZATION PROBLEM

Irregular characteristic analysis of 3D particles—A novel virtual sieving technique

A Novel Approach to Decision-Making on Diagnosing Oncological Diseases Using Machine Learning Classifiers Based on Datasets Combining Known and/or New Generated Features of a Different Nature

Subspace-based minority oversampling for imbalance classification

Feature-Ensemble-Based Crop Mapping for Multi-Temporal Sentinel-2 Data Using Oversampling Algorithms and Gray Wolf Optimizer Support Vector Machine

METAbolomics data Balancing with Over-sampling Algorithms (META-BOA): an online resource for addressing class imbalance.

SW: A weighted space division framework for imbalanced problems with label noise

Seabed Modelling by Means of Airborne Laser Bathymetry Data and Imbalanced Learning for Offshore Mapping.

SA-CGAN: An oversampling method based on single attribute guided conditional GAN for multi-class imbalanced learning

Improved CBSO: A distributed fuzzy-based adaptive synthetic oversampling algorithm for imbalanced judicial data

A Multi-Schematic Classifier-Independent Oversampling Approach for Imbalanced Datasets

An Improving Majority Weighted Minority Oversampling Technique for Imbalanced Classification Problem

Integrating Second-order Moving Average and Over-sampling Algorithm to Predict Apoptosis Protein Subcellular Localization

Identification of Orphan Genes in Unbalanced Datasets Based on Ensemble Learning.