Abstract

This article develops statistical methods for testing the equality of two distributions based on two independent samples generated in some separable metric space. Such methods are broadly applicable in identifying similarity or distinction of two complicated data sets (e.g., high-dimensional data or functional data) collected in a wide range of research or industry areas, including biology, bioinformatics, medicine, material science, among others. Recently a so-called maximum mean discrepancy (MMD) based approach for the above two-sample problem has been proposed, resulting in several interesting tests. However, the main theoretical and numerical results of these MMD based tests depend on the very restricted assumption that the two samples have equal sample sizes. In addition, these tests are generally implemented via permutation when the equal sample size assumption is violated. In real data analysis, this equal sample size assumption is hardly satisfied, and dropping away some of the observations often means the loss of priceless information. It is also of interest to know if an MMD-based test can be conducted generally without using permutation. In this paper, we further study this MMD based approach with the equal sample size assumption removed. We establish the asymptotic null and alternative distributions of the MMD test statistic and its root-n consistency. We propose methods for approximating the null distribution, resulting in easy and quick implementation. Numerical experiments based on artificial data and two real data sets from two different areas of applications demonstrate that in terms of control of the type I error level and power, the resulting tests perform better or no worse than several existing competitors.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call