Abstract

High dimensionality (too many features) is found across many data science domains. Feature selection techniques address this problem by choosing a subset of features whichare more relevant to the problem at hand. These technique scan simply rank the features, but this risks including multiple features which are individually useful but which contain redundant information, subset evaluation techniques, on the other hand, consider the usefulness of whole subsets, and therefore avoid selecting redundant features. Subset-based techniques can either be filters, which apply some statistical test to thesubsets to measure their worth, or wrappers, which judgefeatures based on how effective they are when building a model. One known problem with subset-based techniques is stability: because redundant features are not included, slight changes to the input data can have a significant effect on which features are chosen. In this study, we explore the stability of feature subset selection, including two filter-based techniques and five choices for both the wrapper learner and the wrapper performance metric. We also introduce a new stability metric, the modified Kuncheva's consistency index, which is able tocompare two feature subsets of different size. We also considerboth the stability of the feature selection technique and the average/standard deviation of feature subset size. Our results show that the Consistency feature subset evaluator has thegreatest stability overall, but CFS (Correlation-Based Feature Selection) shows moderate stability with a much smaller standard deviation of feature subset size. All of the wrapper-basedtechniques are less stable than the filter-based techniques, although the Naïve Bayes learner using the AUC performancemetric is the most stable wrapper-based approach.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call