PAKDD’12 best paper: generating balanced classifier-independent training samples from unlabeled data

Youngja Park,Ian M Molloy,Zijie Qi,Suresh N Chari

doi:10.1007/s10115-013-0683-1

Abstract

We consider the problem of generating balanced training samples from an unlabeled data set with an unknown class distribution. While random sampling works well when the data are balanced, it is very ineffective for unbalanced data. Other approaches, such as active learning and cost-sensitive learning, are also suboptimal as they are classifier-dependent and require misclassification costs and labeled samples, respectively. We propose a new strategy for generating training samples, which is independent of the underlying class distribution of the data and the classifier that will be trained using the labeled data. Our methods are iterative and can be seen as variants of active learning, where we use semi-supervised clustering at each iteration to perform biased sampling from the clusters. We provide several strategies to estimate the underlying class distributions in the clusters and to increase the balancedness in the training samples. Experiments with both highly skewed and balanced data from the UCI repository and a private data set show that our algorithm produces much more balanced samples than random sampling or uncertainty sampling. Further, our sampling strategy is substantially more efficient than active learning methods. The experiments also validate that, with more balanced training data, classifiers trained with our samples outperform classifiers trained with random sampling or active learning.

Full Text