Sample Criteria for Testing Outlying Observations

Frank E Grubbs

doi:10.1214/aoms/1177729885

Abstract

The problem of testing outlying observations, although an old one, is of considerable importance in applied statistics. Many and various types of significance tests have been proposed by statisticians interested in this field of application. In this connection, we bring out in the Histrical Comments notable advances toward a clear formulation of the problem and important points which should be considered in attempting a complete solution. In Section 4 we state some of the situations the experimental statistician will very likely encounter in practice, these considerations being based on experience. For testing the significance of the largest observation in a sample of size $n$ from a normal population, we propose the statistic $\frac{S^2_n}{S^2} = \frac{\sum^{n-1}_{i=1} (x_i - \bar x_n)^2}{\sum^n_{i=1} (x_i - \bar x)^2}$ where $x_1 \leq x_2 \leq \cdots \leq x_n, \bar x_n = \frac{1}{n - 1} \sum^{n-1}_{i=1} x_i$ and $\bar x = \frac{1}{n}\sum^{n}_{i=1} x_i.$ A similar statistic, $S^2_1/S^2$, can be used for testing whether the smallest observation is too low. It turns out that $\frac{S^2_n}{S^2} = 1 - \frac{1}{n - 1} \big(\frac{x_n - \bar x}{s}\big)^2 = 1 - \frac{1}{n - 1} T^2_n,$ where $s^2 = \frac{1}{n}\sigma(x_i - \bar x)^2,$ and $T_n$ is the studentized extreme deviation already suggested by E. Pearson and C. Chandra Sekar [1] for testing the significance of the largest observation. Based on previous work by W. R. Thompson [12], Pearson and Chandra Sekar were able to obtain certain percentage points of $T_n$ without deriving the exact distribution of $T_n$. The exact distribution of $S^2_n/S^2$ (or $T_n$) is apparently derived for the first time by the present author. For testing whether the two largest observations are too large we propose the statistic $\frac{S^2_{n-1,n}}{S^2} = \frac{\sum^{n-2}_{i=1} (x_i - \bar x_{n-1,n})^2}{\sum^n_{i=1} (x_i - \bar x)^2},\quad\bar x_{n-1,n} = \frac{1}{n - 2} \sum^{n-2}_{i=1} x_i$ and a similar statistic, $S^2_{1,2}/S^2$, can be used to test the significance of the two smallest observations. The probability distributions of the above sample statistics $S^2 = \sum^n_{i=1} (x_i - \bar x)^2 \text{where} \bar x = \frac{1}{n} \sum^n_{i=1} x_i$ $S^2_n = \sum^{n-1}_{i=1} (x_i - \bar x_n)^2 \text{where} \bar x_n = \frac{1}{n-1} \sum^{n-1}_{i=1} x_i$ $S^2_1 = \sum^n_{i=2} (x_i - \bar x_1)^2 \text{where} \bar x_1 = \frac{1}{n-1} \sum^n_{i=2} x_i$ are derived for a normal parent and tables of appropriate percentage points are given in this paper (Table I and Table V). Although the efficiencies of the above tests have not been completely investigated under various models for outlying observations, it is apparent that the proposed sample criteria have considerable intuitive appeal. In deriving the distributions of the sample statistics for testing the largest (or smallest) or the two largest (or two smallest) observations, it was first necessary to derive the distribution of the difference between the extreme observation and the sample mean in terms of the population $\sigma$. This probability$X_1 \leq x_2 \leq x_3 \cdots \leq x_n$ $s^2 = \frac{1}{n} \sum^n_{i=1} (x_i - \bar x)^2 \quad \bar x = \frac{1}{n} \sum^n_{i=1} x_i$ distribution was apparently derived first by A. T. McKay [11] who employed the method of characteristic functions. The author was not aware of the work of McKay when the simplified derivation for the distribution of $\frac{x_n - \bar x}{\sigma}$ outlined in Section 5 below was worked out by him in the spring of 1945, McKay's result being called to his attention by C. C. Craig. It has been noted also that K. R. Nair [20] worked out independently and published the same derivation of the distribution of the extreme minus the mean arrived at by the present author--see Biometrika, Vol. 35, May, 1948. We nevertheless include part of this derivation in Section 5 below as it was basic to the work in connection with the derivations given in Sections 8 and 9. Our table is considerably more extensive than Nair's table of the probability integral of the extreme deviation from the sample mean in normal samples, since Nair's table runs from $n = 2$ to $n = 9,$ whereas our Table II is for $n = 2$ to $n = 25$. The present work is concluded with some examples.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: The Annals of Mathematical Statistics	Publication Date: Mar 1, 1950
Citations: 1348	License type: implied-oa

R Discovery Prime

R Discovery Prime

Sample Criteria for Testing Outlying Observations

Abstract

Talk to us

Similar Papers

More From: The Annals of Mathematical Statistics

Lead the way for us

Similar Papers

Limiting Distribution of the Studentized Largest Observation1
Simeon Berman
Scandinavian Actuarial Journal | VOL. 1962
Simeon BermanSimeon Berman
01 Jul 1962
Scandinavian Actuarial Journal | VOL. 1962

Can a Poverty Index be Both Relative and Absolute?
Buhong Zheng
Econometrica | VOL. 62
Buhong ZhengBuhong Zheng
01 Nov 1994
Econometrica | VOL. 62

Tests for One of Two Outliers in Normal Samples with Known Variance
R G Mcmillan ... H A David
Technometrics | VOL. 13
R G Mcmillan, et. al.R G Mcmillan ... H A David
01 Feb 1971
Technometrics | VOL. 13

Tests for One of Two Outliers in Normal Samples with Known Variance
R G Mcmillan ... H A David
Technometrics | VOL. 13
R G Mcmillan, et. al.R G Mcmillan ... H A David
01 Feb 1971
Technometrics | VOL. 13

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Sample Criteria for Testing Outlying Observations

Abstract

Talk to us

Similar Papers

More From: The Annals of Mathematical Statistics