High-probability minimax probability machines

Simon Cousins,John Shawe-Taylor

doi:10.1007/s10994-016-5616-2

Abstract

In this paper we focus on constructing binary classifiers that are built on the premise of minimising an upper bound on their future misclassification rate. We pay particular attention to the approach taken by the minimax probability machine (Lanckriet et al. in J Mach Learn Res 3:555–582, 2003), which directly minimises an upper bound on the future misclassification rate in a worst-case setting: that is, under all possible choices of class-conditional distributions with a given mean and covariance matrix. The validity of these bounds rests on the assumption that the means and covariance matrices are known in advance, however this is not always the case in practice and their empirical counterparts have to be used instead. This can result in erroneous upper bounds on the future misclassification rate and lead to the formulation of sub-optimal predictors. In this paper we address this oversight and study the influence that uncertainty in the moments, the mean and covariance matrix, has on the construction of predictors under the minimax principle. By using high-probability upper bounds on the deviation between true moments and their empirical counterparts, we can re-formulate the minimax optimisation to incorporate this uncertainty and find the predictor that minimises the high-probability, worst-case misclassification rate. The moment uncertainty introduces a natural regularisation component into the optimisation, where each class is regularised in proportion to the degree of moment uncertainty. Experimental results would support the view that in the case of with limited data availability, the incorporation of moment uncertainty can lead to the formation of better predictors.

Highlights

In this paper we examine the problem of constructing classifiers that are built to minimise upper bounds on the future misclassification rate of a predictor
We examine the performance of the proposed HP-minimax probability machine (MPM) and compare it to the original MPM, and two other popular binary classification algorithms, Fisher’s discriminant (FDA) (Fisher 1936), and the support vector machine (SVM)
In this paper we addressed an oversight of the original minimax probability machine (Lanckriet et al 2003): that is, the worst-case future misclassification rates depend on prior knowledge of each classes mean and covariance matrix

Summary

Introduction

In this paper we examine the problem of constructing classifiers that are built to minimise upper bounds on the future misclassification rate of a predictor. This theory provides a way of estimating the future misclassification rate of a predictor based on its empirical performance and some measure of the complexity of the predictor function e.g. the Vapnik-Chervonenkis dimension (Vapnik and Chervonenkis 1971) or the fat-shattering dimension (Alon et al 1997) Further work in this direction (Marchand and Shawe-Taylor 2002; Sokolova et al 2002) has been based on the prior assumption that the decision boundary can be constructed as a logical combination of a small set of data derived features. Class-conditional error bounds inspire the approach described in the paragraph

Objectives

Methods

Findings

Conclusion