Outlier Robust Finite Population Estimation

Raymond L Chambers

doi:10.2307/2289084

Abstract

Abstract Outliers in sample data are a perennial problem for applied survey statisticians. Moreover, it is a problem for which traditional sample survey theory offers no real solution, beyond the sensible advice that such sample elements should not be weighted to their fullest extent in estimation. Sample outliers can be identified as of two basic types. Here we are concerned with the first type, which may conveniently be termed representative outliers. These are sample elements with values that have been correctly recorded and that cannot be assumed to be unique. That is, there is no good reason to assume there are no more similar outliers in the nonsampled part of the target population. The remaining sample outliers, which by default are termed nonrepresentative, are sample elements whose data values are incorrect or unique in some sense. Methods for dealing with these nonrepresentative outliers lie basically within the scope of survey editing and imputation theory and are, therefore, not considered in this article. The specific problem considered here is that of robust estimation of a finite population total given sample data containing representative outliers. The approach is model based, in that it assumes the existence of a popular “kernel” superpopulation model that adequately describes the behavior of nonoutliers in the target population. An outlier robust version of the best linear unbiased estimator of the population total under this kernel model is proposed in Section 2. This robust estimator can be viewed as a finite population prediction analog of the well-known M-estimator approach to robust parametric estimation in infinite populations (see Huber 1981). Some asymptotic theory for the proposed estimator is given, based on a central limit theorem for its prediction error under a “gross error” type of outlier generation mechanism. This theory indicates a trade-off between bias and variance robustness for this estimator in situations in which the outlier values are not symmetrically distributed around their kernel model expectations. The article also contains some results from a comparative empirical study of the proposed robust estimator (Sec. 3). This study indicates that the use of this estimator leads to substantial gains over both conventional design-unbiased and “standard” kernel model-based estimation strategies in a population with a significant number of outliers.

Full Text