By the power of Grayskull

Glenn Stone,Laurence A. F. Park

doi:10.1145/2682862.2682878

Abstract

Information Retrieval evaluation is typically performed using a sample of queries and a statistical hypothesis test is used to make inferences about the systems accuracy on the population of queries. Research has shown that the t test is one of a set of tests that provides the greatest statistical power while maintaining acceptable type I error rates, when evaluating with a large sample of queries. In this article, we investigate the effect of using a small query sample on the control of the type I error rate and change in type II error rate of a given set of hypothesis tests, meaning that the hypothesis tests may not satisfy Central Limit Theorem conditions. We found that all test performed similarly for unpaired tests. We also found that the bootstrap test provided greater power for the paired test, but violated the desired type I error rate for the smallest sample size (5 queries).

Full Text