Gene selection stability's dependence on dataset difficulty

Taghi Khoshgoftaar,Randall Wald,Amri Napolitano,David J. Dittman

doi:10.1109/iri.2013.6642491

Abstract

Identifying important biomarkers to improve disease diagnosis and treatment is a significant topic of research in bioinformatics. However, bioinformatics datasets frequently have a large number of features per sample or instance. This problem, known as “high dimensionality,” can be alleviated through the use of dimension reducing techniques such as feature (gene) selection which remove unnecessary features. There are many versions of feature selection, with varying biases and predictive abilities. However, predictive power is but one factor to consider when choosing a feature selection technique: one must also consider the technique's stability, that is, its ability to create feature subsets which remain valid in the face of changes to the data. While there has been work in determining the relative stability of different feature selection techniques, this does not always help determine whether a chosen feature selection technique will give stable feature subsets for a specific dataset. Factors such as difficulty of learning (e.g., dataset difficulty) may also influence feature selection stability, making generally-true facts about different techniques not applicable to a given dataset. In this work, we study how dataset difficulty can affect the stability of feature selection techniques, leading to good performance from bad techniques and vice versa. We use a set of twenty-six DNA microarray datasets with varying levels of difficulty of learning, along with four levels of dataset perturbation, six feature selection techniques with various levels of stability, and twelve feature subset sizes. The results show that as the dataset difficulty increases, the stability decreases. However, the relative stability between the techniques remains the same. Additionally, the more difficult the dataset, the more the stability is affected by changes to the data. We also found that unstable rankers are more affected by the transition between Easy and Moderate datasets, whereas the stable techniques are more affected by the change between Moderate and Hard datasets. Lastly, as the feature subset size increases, the stability increases and the difference between the levels of dataset difficulty decreases. Overall, we conclude that difficulty of learning must be taken into account before interpreting stability results.

Full Text