Abstract

One major problem faced when analyzing DNA microarrays is their high dimensionality (large number of features). Therefore, feature selection is a necessary step when using these datasets. However, the addition or removal of instances can alter the subsets chosen by a feature selection technique. The ideal situation is to choose a feature selection technique that is robust (stable) to changes in the number of instances, with selected features changing little even when instances are added or removed. In this study we test the stability of nineteen feature selection techniques across twenty six datasets with varying levels of class imbalance. Our results show that the best choice of technique depends on the class balance of the datasets. The top performers are Deviance for balanced datasets, Signal to Noise for slightly imbalanced datasets, and AUC for imbalanced datasets. SVM-RFE was the least stable feature selection technique across the board, while other poor performers include Gain Ratio, Gini Index, Probability Ratio, and Power. We also found that enough changes to the dataset can make any feature selection technique unstable, and that using more features increases the stability of most feature selection techniques. Most intriguing was our finding that the more imbalanced a dataset is, the more stable the feature subsets built for that dataset will be. Overall, we conclude that stability is an important aspect of feature ranking which must be taken into account when planning a feature selection strategy or when adding or removing instances from a dataset.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.