Abstract
Minority oversampling techniques have played a pivotal role in the field of imbalanced learning. While traditional oversampling algorithms can cause problems such as intra-class imbalance of samples, ignoring important information of boundary samples, and high similarity between new and old samples. Based on the situation, we proposed a new type of over-sampling method, BIRCH and Boundary Midpoint Centroid Synthetic Minority Over-Sampling Technique (BI-BMCSMOTE). First of all, the algorithm used the BIRCH clustering method to achieve quick cluster of the minority samples. After identifying and removing the noise, it marked the boundary minority samples in the label by probability. Secondly, it generated a density function for each sample cluster, calculated its density and sampling weight, performed midpoint composite sampling among the minority samples marked by probability and other minority samples in each cluster, and then calculated and analyzed the specific value of composite sampling to improve the accuracy of the model. According to the experimental results, the algorithm was proved to be valid.
Highlights
The imbalanced data [1] refers to the amount of one class or several classes of data in a dataset is far larger than that of the other classes
In response to the above problems, this paper proposes a Balanced Iterative Reducing and Clustering Using Hierarchies (BIRCH) and Boundary Midpoint Centroid Synthetic Minority Over-Sampling Technique (BI-BMCSMOTE), which mainly includes four steps: BIRCH clustering, marking boundary minority samples according to probability, calculating cluster density and giving the weight of the sample
The BI-BMCSMOTE algorithm is executed in four steps: conduct BIRCH clustering through a single scan of dataset by applying a tree structure; calculate the number of samples in each cluster according to the cluster density; identify the boundary minority samples and mark them according to probability; synthesize new samples proportionally from the marked boundary minority sample and the normal sample
Summary
The imbalanced data [1] refers to the amount of one class or several classes of data in a dataset is far larger than that of the other classes. The data mining approaches have been used to establish the models and make decisions. When it comes to the classification of the imbalanced data, the traditional classification model is not efficient. This is because the classification models drawn from the standard classifiers, such as logistic regression, support vector machine and decision tree, are not productive and distort some minority samples [2]; or because some exceptions are mistaken as noise, vice versa [3]. The issues brought about by the imbalanced data can be found in many areas of data mining, such as credit card fraud [4], medical diagnosis [5], network intrusion [6], oil leakage [7], etc
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.