Comparative Analysis on the Stability of Feature Selection Techniques Using Three Frameworks on Biological Datasets

Amri Napolitano,Ahmad Abu Shanab,Taghi M Khoshgoftaar ,Randall Wald

doi:10.1109/icmla.2013.85

Abstract

Feature (gene) selection is a common preprocessing technique used to counter the problem of high dimensionality(too many independent features) found in many bioinformaticsdatasets, addressing this problem by creating a smaller feature subset including only the most important features. Although feature selection techniques are often evaluated based on how they can help improve classification performance, it is also important to find stable feature selection techniques which will give consistent results even in the face of dataset perturbations(such as class noise or sampling used to alleviate the problem of imbalanced data). This is especially important in bioinformatics, where the prime concern may be gene discovery rather than classification. In this study we use three frameworks to evaluate the stability of gene selection techniques: "sampledcleanvs. sampled-clean, " "sampled-noisy vs. sampled-noisy, " and" sampled-clean vs. sampled-noisy." All frameworks involve pairwisecomparisons among the results from the perturbed datasets(due to sampling or class noise injection followed by sampling). They differ in terms of whether they observe how sampling can create variation within the feature subsets (sampled-clean vs. sampled-clean), how noisy datasets (which were then sampled)can create a wide spread of selected features (sampled-noisyvs. sampled-noisy), or how features selected on clean and noisy datasets differ, after both datasets have been sampled (sampledcleanvs. sampled-noisy). Along with these three frameworks, our comparison of seven feature ranking techniques uses four cancer gene datasets, applies three sampling techniques, and generates artificial class noise to better simulate real-world datasets. The results from the frameworks are generally similar, with Signal-To-Noise and ReliefF showing the best stability and Gain Ratio showing the worst across all three frameworks, although Relief-W is notable for showing moderate to above-average stability when the clean datasets are used, but giving the second worst performance when noise was present.

Full Text