Modeling of class imbalance using an empirical approach with spambase dataset and random forest classification

Kiranmayi Kotipalli,Shan Suthaharan

doi:10.1145/2656434.2656442

Abstract

Classification of imbalanced data is an important research problem as most of the data encountered in real world systems is imbalanced. Recently a representation learning technique called Synthetic Minority Over-sampling Technique (SMOTE) has been proposed to handle imbalanced data problem. Random Forest (RF) algorithm with SMOTE has been previously used to improve classification performance in minority class over majority class. Although RF with SMOTE demonstrates improved classification performance, the relationship between the classification performance and the imbalanced ratio between the majority and minority classes is not well defined. Therefore mathematical models that describe this relationship is useful especially in the big data environment which suffers from imbalanced data. In this paper, we proposed a mathematical model using an empirical approach applied to the well known Spambase dataset and Random Forest classification approach including its adoption with SMOTE representation learning technique. We have presented a linear model which describes the relationship between true positive classification rate and the imbalanced ratio between the majority and minority classes. This model can help IT researchers to develop better spam filter algorithms.

Full Text