Handling Imbalanced Datasets by Partially Guided Hybrid Sampling for Pattern Recognition

Tushar Sandhan,Jin Young Choi

doi:10.1109/icpr.2014.258

Abstract

Occurrence of high imbalance in real-world domains is a direct result of rarity of interesting events, which results in skewed datasets. Without dataset rebalancing, the learning algorithm will encounter extremely low minority class samples therefore it gets biased towards the majority class in the classification tasks. Hence properly handling the imbalanced dataset is a crucial issue in the pattern recognition domain. We have employed bootstrapping by simultaneous oversampling of the minority class and under sampling of the majority class to build the ensemble of classifiers. Oversampling is partially guided by the extracted hidden patterns from minority class, which prevents its over-generalization and amplify subtle vital patterns. The proposed framework is evaluated on four highly imbalanced datasets with employing a series of classifiers like, support vector machine, logistic regression, nearest neighbor and Gaussian process classifier. Experimental results showed that the pattern classification performance for various tasks improves after rebalancing datasets using the proposed framework.

Full Text