Abstract

Imbalanced datasets are pervasive in classification tasks and would cause degradation of the performance of classifiers in predicting minority samples. Oversampling is effective in resolving the class imbalance problem. However, existing oversampling methods generally introduce noise examples into original datasets, especially when the datasets contain class overlapping regions. In this study, a new oversampling method named Constrained Oversampling is proposed to reduce noise generation in oversampling. This algorithm first extracts overlapping regions in the dataset. Then Ant Colony Optimization is applied to define the boundaries of minority regions. Third, oversampling under constraints is employed to synthesize new samples to get a balanced dataset. Our proposal distinguishes itself from other techniques by incorporating constraints in the oversampling process to inhibit noise generation. Experiments show that it outperforms various benchmark oversampling approaches. The explanation for the effectiveness of our method is given by studying the impact of class overlapping on imbalanced learning.

Highlights

  • The class imbalance problem occurs when some of the classes in a dataset have significantly more samples than the others

  • We introduce an oversampling method that distinguishes itself from other oversampling techniques by incorporating constraints in the oversampling process to inhibit noise generation in overlapping regions

  • This technique is executed in three successive steps: extract overlapping regions based on the KNN algorithm; define boundary samples for each minority instance using ant colony optimization (ACO); synthesize the required amount of minority samples under constraints

Read more

Summary

INTRODUCTION

The class imbalance problem occurs when some of the classes in a dataset have significantly more samples than the others. Synthetic Minority Oversampling Technique (SMOTE), proposed by Chawla et al, forms new minority samples by linearly interpolating between minority samples that lie close to each other in feature space[33] Despite its efficaciousness, this method has a major shortcoming: it blindly generates new samples for minority class examples without considering the distribution of original data, so sometimes noise samples are added into the dataset. There were two versions for their proposal: Borderline-SMOTE1 which only generates samples among borderline instances belonging to minority class and Borderline-SMOTE2 which synthesizes examples between minority samples on the boundary and their nearest negative neighbors Another improved oversampling method is Adaptive Synthetic Sampling (ADASYN), which was presented by He et al [35]. When existing oversampling methods are applied to these overlapping regions, noise minority samples that fall into the majority region are introduced into the datasets These noise samples are detrimental to the performance of classification algorithms.

A NEW OVERSAMPLING METHOD
10: Output
19: Output
DATASETS
EVALUATION METRICS AND PARAMETERS
DISCUSSION
Findings
CONCLUSION
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call