Abstract

Oversampling is the most popular data preprocessing technique. It makes traditional classifiers available for learning from imbalanced data. Through an overall review of oversampling techniques (oversamplers), we find that some of them can be regarded as danger-information-based oversamplers (DIBOs) that create samples near danger areas to make it possible for these positive examples to be correctly classified, and others are safe-information-based oversamplers (SIBOs) that create samples near safe areas to increase the correct rate of predicted positive values. However, DIBOs cause misclassification of too many negative examples in the overlapped areas, and SIBOs cause incorrect classification of too many borderline positive examples. Based on their advantages and disadvantages, a boundary-information-based oversampler (BIBO) is proposed. First, a concept of boundary information that considers safe information and dangerous information at the same time is proposed that makes created samples near decision boundaries. The experimental results show that DIBOs and BIBO perform better than SIBOs on the basic metrics of recall and negative class precision; SIBOs and BIBO perform better than DIBOs on the basic metrics for specificity and positive class precision, and BIBO is better than both of DIBOs and SIBOs in terms of integrated metrics.

Highlights

  • Data is said to be imbalanced when one of its classes has many more examples than that of other classes

  • It can be deduced that the virtual samples created by boundary-information-based oversampler (BIBO) are near the real decision nodes on the decision tree

  • The comparison experiment proved that safe-information-based oversamplers (SIBOs) generating synthetic samples near safe areas improves the performance of spec and pre P and that danger-information-based oversamplers (DIBOs) generating synthetic samples near dangerous areas can improve the performance of rec and pre N

Read more

Summary

Introduction

Data is said to be imbalanced when one of its classes (majority class, negative class) has many more examples than that of other classes (minority class, positive class). Sun et al [18] turned an imbalanced dataset into multiple balanced sub-datasets and used them in base classifiers Another very common way type of ensemble learning is where it is combined with resampling techniques, such as SMOTEBagging [19], random balanceboost [20], and the synthetic oversampling ensemble [21]. SMOTE_IPE [27] is another combined resampling method It uses an iterative-partitioning filter [28] to remove noisy samples in both majority and minority classes to clean up boundaries and make them more regular.

Oversampling Techniques
Motivation
Boundary
Boundary Information
Procedure for the Boundary-Information-Based Oversampler
Procedure Begin
Strengths Analysis
Experiment
Evaluation
Dataset Description
The Simulated Datasets
The Real-World Datasets
Oversampler Performance Evalutation
Comparative Strengths Results
Performance Results
Comparative Results of Computational Complexity
An Example of Using the Proposed BIBO Method
Conclusions
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call