Sampling and empirical risk minimization

Stephan Clémençon,Patrice Bertail,Emilie Chautru

doi:10.1080/02331888.2016.1259810

Abstract

ABSTRACTIn certain situations that shall be undoubtedly more and more common in the Big Data era, the datasets available are so massive that computing statistics over the full samples is hardly feasible, if not unfeasible. A natural approach in this context consists in using survey schemes and substituting the ‘full data’ statistics with their counterparts based on the resulting random samples, of manageable size. It is the main purpose of this paper to investigate the impact of survey sampling on statistical learning methods based on empirical risk minimization through the standard binary classification problem, considered here as a ‘case in point’. Precisely, we prove that, in presence of auxiliary information, appropriate use of optimally coupled Poisson survey plans may not affect much the learning rates, while possibly reducing significantly the number of terms that must be averaged to compute the empirical risk functional with overwhelming probability. These striking results are next shown to extend to more general sampling schemes by means of a coupling technique, originally introduced by Hajek [Asymptotic theory of rejective sampling with varying probabilities from a finite population. Ann Math Stat. 1964;35(4):1491–1523].

Full Text