Estimate-based goodness-of-fit test for large sparse multinomial distributions

Sung-Ho Kim,Hyemi Choi,Sangjin Lee

doi:10.1016/j.csda.2008.10.011

Abstract

The Pearson’s chi-squared statistic ( X 2 ) does not in general follow a chi-square distribution when it is used for goodness-of-fit testing for a multinomial distribution based on sparse contingency table data. We explore properties of [Zelterman, D., 1987. Goodness-of-fit tests for large sparse multinomial distributions. J. Amer. Statist. Assoc. 82 (398), 624–629] D 2 statistic and compare them with those of X 2 and compare the power of goodness-of-fit test among the tests using D 2 , X 2 , and the statistic ( L r ) which is proposed by [Maydeu-Olivares, A., Joe, H., 2005. Limited- and full-information estimation and goodness-of-fit testing in 2 n contingency tables: A unified framework. J. Amer. Statist. Assoc. 100 (471), 1009–1020] when the given contingency table is very sparse. We show that the variance of D 2 is not larger than the variance of X 2 under null hypotheses where all the cell probabilities are positive, that the distribution of D 2 becomes more skewed as the multinomial distribution becomes more asymmetric and sparse, and that, as for the L r statistic, the power of the goodness-of-fit testing depends on the models which are selected for the testing. A simulation experiment strongly recommends to use both D 2 and L r for goodness-of-fit testing with large sparse contingency table data.

Full Text