Abstract

DNA microarrays are a modern advancement in the analysis of genetic data. This technology allows a researcher to test samples for thousands of genes simultaneously. However, once the samples in the DNA microarrays have been tested, the researcher must then search through the data collected and identify genes important to their problem. A possible solution to this issue is the data mining pre-processing technique called feature selection. Feature (gene) selection takes the original set of features (in the case of DNA microarrays, gene probes) and chooses an optimal subset to perform analysis from. Ideally, the reduced subset only contains the most important features as determined by the feature selection technique (or set of feature selection techniques), which allows for further research in the discovered genes. However in the case of using multiple feature selection techniques, the set of techniques must be diverse in order to reduce redundancy among the chosen features. Another benefit of increasing diversity is that any features chosen across a diverse set of feature selection techniques will have more importance than those chosen by a single technique or a set of related ones. Therefore, it would be useful to know how similar the feature selection techniques are to each other. In this study we perform an analysis of eighteen feature selection techniques across nine imbalanced DNA microarray datasets and using four feature subset sizes. Our results found that one should not use Gini Index and Probability Ratio together or the Kolmogorov-Smirnov statistic and Geometric Mean together at any feature subset size in order to minimize redundancy, and that the members of the first of these pairs (along with the pair of ReliefF and ReliefF-W) are very dissimilar to all rankers outside their own cluster. We also found that Chi-Squared, Information Gain, and Symmetric Uncertainty form a cluster of similarity, as do Chi-Squared, Deviance, F-Measure, and Mutual Information.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call