Abstract

Learning with imbalanced data sets is considered as one of the key topics in machine learning community. Stacking ensemble is an efficient algorithm for normal balance data sets. However, stacking ensemble was seldom applied in imbalance data. In this paper, we proposed a novel RE-sample and Cost-Sensitive Stacked Generalization (RECSG) method based on 2-layer learning models. The first step is Level 0 model generalization including data preprocessing and base model training. The second step is Level 1 model generalization involving cost-sensitive classifier and logistic regression algorithm. In the learning phase, preprocessing techniques can be embedded in imbalance data learning methods. In the cost-sensitive algorithm, cost matrix is combined with both data characters and algorithms. In the RECSG method, ensemble algorithm is combined with imbalance data techniques. According to the experiment results obtained with 17 public imbalanced data sets, as indicated by various evaluation metrics (AUC, GeoMean, and AGeoMean), the proposed method showed the better classification performances than other ensemble and single algorithms. The proposed method is especially more efficient when the performance of base classifier is low. All these demonstrated that the proposed method could be applied in the class imbalance problem.

Highlights

  • Classification learning becomes complicated if the class distribution of the data is imbalanced

  • The results showed that the performance of the proposed RE-sample and Cost-Sensitive Stacked Generalization (RECSG) method was the best for 12 of 17 data sets in terms of geometric mean of sensitivity and specificity (GeoMean) and AGeoMean and for 10 of 17 data sets in terms of area under the ROC curve (AUC)

  • In order to solve the class imbalance problem, we proposed the RECSG method based on 2-layer learning models

Read more

Summary

Introduction

Classification learning becomes complicated if the class distribution of the data is imbalanced. REsample technique is to increase the minority class of instances (oversampling) [4] or decrease the majority class of instances (undersampling) [5, 6]. Costsensitive and algorithms levels are more associated with the imbalance problem, whereas data level and ensemble learning can be used and independent of the single classifier. Accuracy is the most popular evaluation metric It cannot effectively measure the correct rates of all the classes, so it is not an appropriate metric for imbalance data sets. For this reason, in addition to accuracy, more suitable metrics should be considered in the imbalance

Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call