Abstract

Oversampling is a promising preprocessing technique for imbalanced datasets which generates new minority instances to balance the dataset. However, improper generated minority instances, i.e., noise instances, may interfere the learning of the classifier and impact it negatively. Given this, in this paper, we propose a simple and effective oversampling approach known as ASN-SMOTE based on the k-nearest neighbors and the synthetic minority oversampling technology (SMOTE). ASN-SMOTE first filters noise in the minority class by determining whether the nearest neighbor of each minority instance belongs to the minority or majority class. After that, ASN-SMOTE uses the nearest majority instance of each minority instance to effectively perceive the decision boundary, inside which the qualified minority instances are selected adaptively for each minority instance by the proposed adaptive neighbor selection scheme to synthesize new minority instance. To substantiate the effectiveness, ASN-SMOTE has been applied to three different classifiers and comprehensive experiments have been conducted on 24 imbalanced benchmark datasets. ASN-SMOTE is also extensively compared with nine notable oversampling algorithms. The results show that ASN-SMOTE achieves the best results in the majority of datasets. The ASN-SMOTE implementation is available at: https://www.github.com/yixinkai123/ASN-SMOTE/.

Highlights

  • The problem of class-imbalanced data in machine learning may occur often, which means that the class distribution of data in binary or multi-class classification problems has a significant slant

  • The first rank is allocated to the optimum oversampling technique and the eighth rank is assigned to the worst performing technique

  • We find that the random oversampling method performs the most poorly on all three measures when k-nearest neighbors (KNN) was used as the classifier

Read more

Summary

Introduction

The problem of class-imbalanced data in machine learning may occur often, which means that the class distribution of data in binary or multi-class classification problems has a significant slant. Within the binary classification problem, the majority class involves a large number of instances, while only a few instances in the minority class [42]. Such problems often appear in practical applications in bank fraudulent transaction detection methods [36], credit risk assessment [34], text classification [41], biomedical diagnosis [2,59] and firewall intrusion detection [5]. Due to the universal existence of imbalanced datasets in practical applications and the difficulty for traditional classifiers to deal with them, learning from class-imbalanced data has attracted the attention of many prominent researchers over the last 20 years [31]. Many preprocessing methods have been put forward to deal with class imbalance of datasets

Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call