Query By Committee Research Articles

Strategies for selecting informative data points for training prediction algorithms are important, particularly when data points are difficult and costly to obtain. A Query by Committee (QBC) training strategy for selecting new data points uses the disagreement between a committee of different algorithms to suggest new data points, which most rationally complement existing data, that is, they are the most informative data points. In order to evaluate this QBC approach on a real-world problem, we compared strategies for selecting new data points. We trained neural network algorithms to obtain methods to predict the binding affinity of peptides binding to the MHC class I molecule, HLA-A2. We show that the QBC strategy leads to a higher performance than a baseline strategy where new data points are selected at random from a pool of available data. Most peptides bind HLA-A2 with a low affinity, and as expected using a strategy of selecting peptides that are predicted to have high binding affinities also lead to more accurate predictors than the base line strategy. The QBC value is shown to correlate with the measured binding affinity. This demonstrates that the different predictors can easily learn if a peptide will fail to bind, but often conflict in predicting if a peptide binds. Using a carefully constructed computational setup, we demonstrate that selecting peptides with a high QBC performs better than low QBC peptides independently from binding affinity. When predictors are trained on a very limited set of data they cannot be expected to disagree in a meaningful way and we find a data limit below which the QBC strategy fails. Finally, it should be noted that data selection strategies similar to those used here might be of use in other settings in which generation of more data is a costly process.

Read full abstract

A long-standing goal in the realm of Machine Learning is to minimize sample-complexity, i.e. to reduce as much as possible the number of examples used in the course of learning. The Active Learning paradigm is one such method aimed at achieving this goal by transforming the learner from a passive participant in the information gathering process to an active one. Vaguely speaking, the learner tries to minimize the number of labeled instances used in the course of learning, relaying also on unlabelled instances in order to acquire the needed information whenever possible. The reasoning comes from many real-life problems where the teacher's activity is an expensive resource (e.g. text categorization, part of speech tagging). The Query By Committee (QBC) (Seung et al., Query by committee, Proceedings of the Fifth Workshop on Computational Learning theory, Morgan Kaufman, San Mateo, CA, 1992, pp. 287–294) is an Active Learning algorithm acting in the Bayesian model of concept learning, (Haussler et al., Mach. Learning 14 (1994) 83) i.e. it assumes that the concept to be learned is chosen according to some fixed and known distribution. Trying to apply the QBC algorithm for learning the class of linear separators, one faces the problem of implementing the mechanism of sampling hypotheses (the Gibbs oracle). The major problem is computational-complexity, since the straightforward Monte Carlo method takes exponential time. In this paper we address the problems involved in the implementation of such a mechanism. We show how to convert them to questions about sampling from convex bodies or approximating the volume of such bodies. Similar problems have recently been solved in the field of computational geometry based on random walks. These techniques enable us to device efficient implementations of the QBC algorithm. We also give few improvements and corrections to the QBC algorithm, the most important one is dropping the Bayes assumption when the concept classes possess a sort of symmetry property (which holds for linear separators). We draw attention to a useful geometric lemma which bounds the maximal radius of a ball contained in a convex body. Finally, this paper exhibits a connection between random walks and certain Machine Learning notions such as ε-net and support vector machines.

Read full abstract

Query By Committee Research Articles

Articles published on Query By Committee

Variational Bayes for continuous hidden Markov models and its application to active learning

Selecting informative data for developing peptide-MHC binding predictors using a query by committee approach.

Query by committee, linear separation and random walks

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Query By Committee Research Articles

Articles published on Query By Committee

Variational Bayes for continuous hidden Markov models and its application to active learning

Selecting informative data for developing peptide-MHC binding predictors using a query by committee approach.

Query by committee, linear separation and random walks