Abstract

Imbalanced data might cause some issues in problem definition level, algorithm level, and data level. Some of the methods have been developed to overcome this issue, one of state-of-the-art method is Easy Ensemble. Easy Ensemble was claimed can improve model performance to classify minority class, and overcome the deficiency of random under- sampling. In this paper we discussed the implementation of Easy Ensemble with Random Forest Classifiers to handle imbalance problem in credit scoring case. This combination method is implemented in two datasets which taken from data science competition website, finhacks.id and kaggle.com with class proportion within majority and minority is 70:30 and 94:6. The results showed that resampling with Easy Ensemble can improve Random Forest classifier performance upon minority class. Recall on minority class increased significantly after the resampling. Before resampling, the recall on minority class for the first dataset (finhacks.id) was 0.49, and increased to 0.82 after the resampling. Similar results were obtained for the second data set (kaggle.com), where the recall for the minority class was increased from just 0.14 to 0.73.

Highlights

  • In a real-world problem, cases with imbalanced data are common; for example, in medical case which classify breast cancer type [1], cervical cancer [2], and lung cancer [3]

  • In addition to its high classification accuracy, random forest is considered as variable selection tool, which improves the performance of the predicting model [10,11,12]. This approach might be motivated due to the robustness of the result of random forest, where the selected important variables should come as a result of their consistency in the splitting rule when they were chosen in the random feature selection in generating a tree for each new bootstrap data. Considering these studies, we propose the use of random forest for classification in this study

  • We propose the Easy Ensemble method as an imbalance learning to handle imbalance problem in classification with Random Forest as a classifier

Read more

Summary

Introduction

In a real-world problem, cases with imbalanced data are common; for example, in medical case which classify breast cancer type [1], cervical cancer [2], and lung cancer [3]. In financial case, imbalanced data problems are found, such as credit scoring classification [4] and fraud detection [5]. Imbalanced data may cause problem in building a model, output of the classification model tends to predict majority class. The last data generated was that with the heavily imbalanced proportion between the two classes, that is at 5: 95

Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.