Abstract The techniques of exploratory data analysis include a resistant rule for identifying possible outliers in univariate data. Using the lower and upper fourths, FL and FU (approximate quartiles), it labels as “outside” any observations below FL − 1.5(FU — FL ) or above FU + 1.5(FU — FL ). For example, in the ordered sample −5, −2, 0, 1, 8, FL = −2 and FU = 1, so any observation below −6.5 or above 5.5 is outside. Thus the rule labels 8 as outside. Some related rules also use cutoffs of the form FL — k(FU — FL ) and FU + k(FU — FL ). This approach avoids the need to specify the number of possible outliers in advance; as long as they are not too numerous, any outliers do not affect the location of the cutoffs. To describe the performance of these rules, we define the some-outside rate per sample as the probability that a sample will contain one or more outside observations. Its complement is the all-inside rate per sample. We also define the outside rate per observation as the average fraction of outside observations. For Gaussian data the population all-inside rate per sample (0) and the population outside rate per observation (.7%) substantially understate the corresponding small-sample values. Simulation studies using Gaussian samples with n between 5 and 300 yield detailed information on the resistant rules. The main resistant rule (k = 1.5) has an all-inside rate per sample between 67% and 86% for 5 ≤n ≤ 20, and corresponding estimates of its outside rate per observation range from 8.6% to 1.7%. Both characteristics vary with n in ways that lead to good empirical approximations. Because of the way in which the fourths are defined, the sample sizes separate into four classes, according to whether dividing n by 4 leaves a remainder of 0, 1, 2, or 3. Within these four classes the all-inside rate per sample shows a roughly linear decrease with n over the range 9 ≤ n ≤ 50, and the outside rate per observation decreases linearly in 1/n for n ≥ 9. A more theoretical approximation for the all-inside rate per sample works with the order statistics X (1) ≤ … ≤ X (n). In this notation the fourths are X(f) and X (n + 1 — f) with f = ½[(n + 3)/2], where [·] is the greatest-integer function. A sample has no observations outside whenever {X(f)−X(1)}/{X(n+1-f)−X(f)}≤k and {X(n)−X(n+1-f)}/{X(n+1-f)−X(f)}≤k. We first approximate the numerators and denominator in these ratios by constant multiples of chi-squared random variables with the same mean and variance. We then approximate the logarithm of each ratio by a Gaussian random variable, and we calculate the correlation between these variables from the fact that the ratios have the same denominator. Finally, a bivariate Gaussian probability calculation yields the approximate all-inside rate per sample. The error of the result relative to the simulation estimate is typically from 1% to 2% for 5 ≤ n ≤ 50. To provide an indication of how the two rates behave in alternative “null” situations, the simulation studies included samples from five heavier-tailed members of the family of h distributions. For a given sample size, the all-inside rate per sample decreases as the tails become heavier, and the outside rate per observation increases.
Read full abstract