Abstract

Given samples from two distributions, a non-parametric two-sample test aims at determining whether the two distributions are equal or not, based on a test statistic. Classically, this statistic is computed on the whole data set, or is computed on a subset of the data set by a function trained on its complement. We consider methods in a third tier, so as to deal with large (possibly infinite) data sets, and to automatically determine the most relevant scales to work at, making two contributions. First, we develop a generic sequential non-parametric testing framework, in which the sample size need not be fixed in advance. This makes our test a truly sequential non-parametric multivariate two-sample test. Under information theoretic conditions qualifying the difference between the tested distributions, consistency of the two-sample test is established. Second, we instantiate our framework using nearest neighbor regressors, and show how the power of the resulting two-sample test can be improved using Bayesian mixtures and switch distributions. This combination of techniques yields automatic scale selection, and experiments performed on challenging data sets show that our sequential tests exhibit comparable performances to those of state-of-the-art non-sequential tests.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call