Impact of Noise and Data Sampling on Stability of Feature Selection

Ahmad Abu Shanab,Taghi M Khoshgoftaar,Randall Wald

doi:10.1109/icmla.2011.74

Abstract

High dimensionality is one of the major problems in data mining, occurring when there is a large abundance of attributes. One common technique used to alleviate high dimensionality is feature selection, the process of selecting the most relevant attributes and removing irrelevant and redundant ones. Much research has been done towards evaluating the performance of classifiers before and after feature selection, but little work has been done examining how sensitive the selected feature subsets are to changes (additions/deletions) in the dataset. In this study we evaluate the robustness of six commonly used feature selection techniques, investigating the impact of data sampling and class noise on the stability of feature selection. All experiments are carried out with six commonly used feature rankers on four groups of datasets from the biology domain. We employ three sampling techniques, and generate artificial class noise to better simulate real-world datasets. The results demonstrate that although no ranker consistently outperforms the others, Gain Ratio shows the least stability on average. Additional tests using our feature rankers for building classification models also show that a feature ranker's stability is not an indicator of its performance in classification.

Full Text