Hybrid Data-Level Techniques for Class Imbalance Problem

Anjana Gosain,Arushi Gupta,Deepika Singh

doi:10.1007/978-981-15-5113-0_95

Abstract

In data mining, the task of classification is to identify an instance in a dataset into one of the predefined classes. In real-life applications, the traditional classification does not work well for imbalanced datasets, i.e., where one class contains very few number of data points, named as the minority class, as compared to other class(es), named as the majority class(es). This problem of imbalanced dataset distribution is termed as the class imbalance problem (CIP). To solve CIP, the researchers examined the effects of CIP on the performance of classifier and proposed various techniques to handle this problem. In literature, these techniques are majorly classified into three levels: data-level approaches (or pre-processing techniques), algorithm-level approaches and ensemble-level approaches. The sampling-based approaches are further subdivided into three categories, such as oversampling techniques, undersampling techniques and hybrid sampling (undersampling + oversampling) techniques. In this paper, we proposed three hybrid sampling techniques (named as Bor-SMOTE+TL, TL+C-SMOTE, SL-SMOTE+TL) using Tomek links (an undersampling) technique combined with the oversampling techniques. The experiments are carried out using real-life imbalanced datasets to show the usefulness of the proposed techniques as compared to the existing sampling techniques.

Full Text