Contrasting Undersampled Boosting with Internal and External Feature Selection for Patient Response Datasets

Taghi M Khoshgoftaar,David J Dittman,Randall Wald,Amri Napolitano

doi:10.1109/icmla.2013.156

Abstract

Class imbalance (where one class has many more instances than the other class(es)) and high dimensionality (large number of features per instance) are two prevalent problems that are frequently present in patient response datasets. In addition to these problems, these datasets are notoriously difficult to build effective models from. This paper introduces a new hybrid boosting algorithm named SelectRUSBoost which combines data sampling and feature selection with every iteration of boosting. We test SelectRUSBoost along with RUSBoost combined with external feature selection on a set of five patient response datasets. In addition to the datasets we also utilize two classifiers, three filter-based feature selection techniques, and four feature subset sizes. Our results show that SelectRUSBoost will, with few exceptions, outperform RUSBoost combined with external feature selection. Also, the feature selection technique information gain outperformed the other techniques for all combinations of boosting approach, classifier, and feature subset size, and in addition for this feature selection technique SelectRUSBoost always (without exception) outperformed RUSBoost combined with external selection. Statistical analysis confirmed that SelectRUSBoost gives better performance than RUSBoost combined with external selection. This is the first work which utilizes SelectRUSBoost in a bioinformatics study.

Full Text