Abstract

Imbalanced data is a kind of information that occurs in real life, such as medical diagnosis in which records of seriously ill patients outnumber by records of healthy ones. These imbalanced data affect the learning performance of algorithms in data mining. The boundary of decision in out of balance data chosen by most standard algorithms of machine learning tends to bias toward the majority class and hence misclassify the minority class. Therefore, we present an approach for dealing with imbalanced data classification problem by applying the decision tree ensemble learning using both bagging and boosting techniques to build models that compensate the misclassification with cost sensitive learning. In this research, we build the model templates from different characteristics of synthetic data. We have chosen an appropriate model template for the real data with different imbalanced rating and overlapping ratio. The results showed that the chosen model template can solve the imbalanced data classification problem efficiently. But there are some model templates that cannot classify correctly when imbalanced rate increases.

Highlights

  • Data mining(1) is a method that has been extensively used to retrieve the hidden knowledge from a large information repository

  • We perform stratified sampling to draw samples from imbalanced datasets at different imbalanced ratios and analyze the characteristics of data to find the suitable model from the model templates

  • Imbalanced data classification is a significant challenge for standard algorithms of machine learning

Read more

Summary

Introduction

Data mining(1) is a method that has been extensively used to retrieve the hidden knowledge from a large information repository. Most standard algorithms for data classification can be applied very efficiently in terms of overall classification accuracy if data in each class are in equal proportion. These algorithms show poor learning performance when classifying the imbalanced data that have amount of instances in the group of interest less than those in the other groups(2). Nmajority nminority (1) (b) Lack of data This problem occurs when the size of samples in the minority class is too small(9). Because of small sample size will cause difficulty in finding the patterns

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call