Abstract

Most feature selection methods identify only a single solution. This is acceptable for predictive purposes, but is not sufficient for knowledge discovery if multiple solutions exist. We propose a strategy to extend a class of greedy methods to efficiently identify multiple solutions, and show under which conditions it identifies all solutions. We also introduce a taxonomy of features that takes the existence of multiple solutions into account. Furthermore, we explore different definitions of statistical equivalence of solutions, as well as methods for testing equivalence. A novel algorithm for compactly representing and visualizing multiple solutions is also introduced. In experiments we show that (a) the proposed algorithm is significantly more computationally efficient than the TIE* algorithm, the only alternative approach with similar theoretical guarantees, while identifying similar solutions to it, and (b) that the identified solutions have similar predictive performance.

Highlights

  • Feature selection is an essential part of data analysis tasks which focus on knowledge discovery and improving understanding of the problem under study

  • While finding a single solution may be acceptable for building a predictive model, it is not sufficient when feature selection is employed for knowledge discovery

  • In order to reduce the chance of false positive equivalences, we recommend to (a) perform extensive tuning of the hyper-parameters3 of the feature selection algorithm, in order to increase the chance of identifying Markov blankets, (b) first apply a permutation-based variance test for PEQ or MEQ to quickly filter out false equivalences and (c) afterwards apply an IEQ test using the comprehensive approach to decide for equivalence, and (d) use relatively high significance levels to further reduce the number of false positives

Read more

Summary

Introduction

Feature selection is an essential part of data analysis tasks which focus on knowledge discovery and improving understanding of the problem under study This is no accident, as the solution has been shown to be directly related to the data-generating causal mechanism (Koller and Sahami 1996; Tsamardinos and Aliferis 2003; Aliferis et al 2010). The difference in predictive performance between two solutions may not be statistically distinguishable In domains such as molecular biology there often exist multiple solutions, possibly because of the inherent redundancy present in the underlying biological system (Dougherty and Brun 2006; Statnikov and Aliferis 2010). A feature selection algorithm should identify all solutions that are “equivalent” (for some reasonable definition of equivalence) Another advantage of outputting multiple solutions is that one could use any of them for building a predictive model.

Preliminaries
Conditional independence
The single and multiple feature selection problems
The JKP taxonomy of features
A taxonomy of features in the presence of multiple solutions
Statistically equivalent feature sets
Definitions of statistical equivalence of feature sets
Vuong’s variance test
The comprehensive approach
The J-test
Paired two-sample tests
Discovering multiple Markov blankets
Tests with feature sets that are not Markov blankets
Power of IEQ tests
Reliability of PEQ and MEQ tests
Summary
A general template for forward–backward algorithms
Extending TFBS for multiple solutions
A strategy to avoid repeating states
The TMFBS algorithm for multiple solutions
Theoretical properties of TMFBS
Sound rules for pruning the search space
Computational complexity
Summarizing and visualizing multiple solutions
Multiple solution graphs
An algorithm for constructing multiple solutions graphs
Compression operations
Algorithms for forward and backward compression
Related methods
Experimental evaluation
Evaluation of TMFBS and comparison with TIE*
Number of solutions and speed-up with increasing sample size
Conclusion
B: Additional results for the comparison of TMFBS with TIE*
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.