Abstract

What is the simplest thing you can do to solve a problem? In the context of semi-supervised feature selection, we tackle exactly this—how much we can gain from two simple classifier-independent strategies. If we have some binary labelled data and some unlabelled, we could assume the unlabelled data are all positives, or assume them all negatives. These minimalist, seemingly naive, approaches have not previously been studied in depth. However, with theoretical and empirical studies, we show they provide powerful results for feature selection, via hypothesis testing and feature ranking. Combining them with some “soft” prior knowledge of the domain, we derive two novel algorithms (Semi-JMI, Semi-IAMB) that outperform significantly more complex competing methods, showing particularly good performance when the labels are missing-not-at-random. We conclude that simple approaches to this problem can work surprisingly well, and in many situations we can provably recover the exact feature selection dynamics, as if we had labelled the entire dataset.

Highlights

  • Many real-world applications have limited access to labelled data, but abundant access to large amounts of unlabelled data

  • The first contribution is in terms of asking what happens to the false positive rate (FPR)/false negative rate (FNR) if we test with surrogate variables G(X ; Y0) or G(X ; Y1), instead of the ideal G(X ; Y )? In Sect. 3 we prove that the answer to this question is:

  • In this context, discovering the Markov Blanket (MB) can be useful for eliminating irrelevant features or features that are redundant in the context of others, and as a result it plays a fundamental role in filter feature selection

Read more

Summary

Introduction

Many real-world applications have limited access to labelled data, but abundant access to large amounts of unlabelled data. We tackle two semi-supervised scenarios—when the labels are missing completely at random (MCAR), and a missing-not-at-random scenario (MAR-C) where the class labels are missing according to a mechanism, dependent on the class label itself (Moreno-Torres et al 2012). The latter might occur for example when there is a social stigma associated with reporting of a label, such as income levels or HIV incidence. We use these properties to derive novel feature selection algorithms, which turn out to be highly competitive with significantly more complex procedures

Summary of results
Background
Feature selection by testing independence—Markov Blanket discovery
Phase II: backward — shrinkage
Feature selection by ranking—information theoretic methods
Semi-supervised learning
Motivating an inference-free approach and related work
Surrogate approaches for hypothesis testing
Conditional independence tests in semi-supervised learning
The switching procedure applied to Markov Blanket discovery—Semi-IAMB
Surrogate approaches for feature ranking
13 Step 4
Extending to higher order criteria
Application 1
MB discovery in positive-unlabelled learning
Incorporating “exact” prior knowledge in sample size determination
Evaluation of MB discovery in PU data
MB discovery in semi-supervised learning under class-prior-change
Comparing information theoretic feature selection approaches
Exploring the consistency of the selected subsets
Exploring the misclassification error
Comparison with state-of-the-art semi-supervised feature selection methods
Summary of contributions
Future work
A Tutorial on information theoretic testing and estimation
Theorem 1
Theorem 2
Theorem 3
Theorem 4
Theorem 6
Theorem 7

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.