Binomial Confidence Intervals and Contingency Tests: Mathematical Fundamentals and the Evaluation of Alternative Methods

Sean Wallis

doi:10.1080/09296174.2013.799918

Abstract

Many statistical methods rely on an underlying mathematical model of probability based on a simple approximation, one that is simultaneously well-known and yet frequently misunderstood. The Normal approximation to the Binomial distribution underpins a range of statistical tests and methods, including the calculation of accurate confidence intervals, performing goodness of fit and contingency tests, line- and model-fitting, and computational methods based upon these. A common mistake is in assuming that, since the probable distribution of error about the “true value” in the population is approximately Normally distributed, the same can be said for the error about an observation. This paper is divided into two parts: fundamentals and evaluation. First, we examine the estimation of confidence intervals using three initial approaches: the “Wald” (Normal) interval, the Wilson score interval and the “exact” Clopper-Pearson Binomial interval. Whereas the first two can be calculated directly from formulae, the Binomial interval must be approximated towards by computational search, and is computationally expensive. However this interval provides the most precise significance test, and therefore will form the baseline for our later evaluations. We also consider two further refinements: employing log-likelihood in intervals (also requiring search) and the effect of adding a continuity correction. Second, we evaluate each approach in three test paradigms. These are the single proportion interval or 2 × 1 goodness of fit test, and two variations on the common 2 × 2 contingency test. We evaluate the performance of each approach by a “practitioner strategy”. Since standard advice is to fall back to “exact” Binomial tests in conditions when approximations are expected to fail, we report the proportion of instances where one test obtains a significant result when the equivalent exact test does not, and vice versa, across an exhaustive set of possible values. We demonstrate that optimal methods are based on continuity-corrected versions of the Wilson interval or Yates’ test, and that commonly-held beliefs about weaknesses of tests are misleading. Log-likelihood, often proposed as an improvement on , performs disappointingly. Finally we note that at this level of precision we may distinguish two types of 2 2 test according to whether the independent variable partitions data into independent populations, and we make practical recommendations for their use.

Full Text