Ensemble Learning For Imbalanced Data Classification Problem

Pasapitch Chujai,Nittaya Kerdprasop,Kittisak Kerdprasop,Kittipong Chomboon,Pongsakorn Teerarassamee

doi:10.12792/iciae2015.079

Abstract

Imbalanced data is a kind of information that occurs in real life, such as medical diagnosis in which records of seriously ill patients outnumber by records of healthy ones. These imbalanced data affect the learning performance of algorithms in data mining. The boundary of decision in out of balance data chosen by most standard algorithms of machine learning tends to bias toward the majority class and hence misclassify the minority class. Therefore, we present an approach for dealing with imbalanced data classification problem by applying the decision tree ensemble learning using both bagging and boosting techniques to build models that compensate the misclassification with cost sensitive learning. In this research, we build the model templates from different characteristics of synthetic data. We have chosen an appropriate model template for the real data with different imbalanced rating and overlapping ratio. The results showed that the chosen model template can solve the imbalanced data classification problem efficiently. But there are some model templates that cannot classify correctly when imbalanced rate increases.

Highlights

Data mining(1) is a method that has been extensively used to retrieve the hidden knowledge from a large information repository
We perform stratified sampling to draw samples from imbalanced datasets at different imbalanced ratios and analyze the characteristics of data to find the suitable model from the model templates
Imbalanced data classification is a significant challenge for standard algorithms of machine learning

Summary

Introduction

Data mining(1) is a method that has been extensively used to retrieve the hidden knowledge from a large information repository. Most standard algorithms for data classification can be applied very efficiently in terms of overall classification accuracy if data in each class are in equal proportion. These algorithms show poor learning performance when classifying the imbalanced data that have amount of instances in the group of interest less than those in the other groups(2). Nmajority nminority (1) (b) Lack of data This problem occurs when the size of samples in the minority class is too small(9). Because of small sample size will cause difficulty in finding the patterns

Methods

Results

Conclusion