Many fundamental multivariate methods use the F distribution and its associated tests and critical values: it is the basis of the many common statistical tests in chemometrics, for example, for detecting outliers or whether an observation belongs to a predefined class. We have seen that the t distribution is appropriate for estimating critical values or confidence limits when a population has an underlying normal distribution, but the sample size is small. This is primarily a consequence of the difficulty of determining a population standard deviation, and using a method that more often than not underestimates it, the apparent distribution from the mean is distorted. When we discussed the chi squared distribution 1, we noted that this represented the distribution of squared Mahalanobis distances from the mean, and in particular that if more than one variable is measured, there is no specific positive or negative direction, and as such, using squared distances (which are independent of direction) was essential. Hence, the chi-square distribution naturally extends from univariate to multivariate data. The F distribution can be regarded as the equivalent extension of the t distribution when there is more than one variable but small sample sizes. There are numerous ways of introducing this distribution in the literature, which is widely employed in many diverse areas. In this and the next article, we focus primarily on the distribution of data in multidimensional space: the F distribution is often introduced in the context of analysis of variance. We will come across this distribution and its associated statistic in other contexts in later articles. The F is named after R. A. Fisher, who was a pioneering statistician most active in the 1920s and 1930s, and who worked in agricultural science in the UK. Many of the fundamental multivariate methods such as several approaches for one class classification or class modelling, including SIMCA and multivariate statistical process control, use the F distribution and its associated tests and critical values. It is the basis of the many common statistical tests in chemometrics, for example, for detecting outliers or whether an observation belongs to a predefined class. However, in order to understand it, it is necessary to also understand its relationship to other distributions. If there are a large number of observations (i.e. ν2 is large), then the shape of the F distribution is very similar to the chi squared distribution with ν1 degrees of freedom as illustrated in Figure 2, although there is a shift in position (in fact, chi squared equals ν1 F, and for 1 degree of freedom, they are both the same as ν1 = 1). Note that if both ν1 and ν2 are large, the F distribution also resembles the normal distribution, with a mean of 1. Figure 3 illustrates several different F distributions. We can note several things. If ν1 and ν2 are large, then we can see from the equations earlier that the mean is approximately equal to 1 and the variance to 4/ν2, so the larger ν2, the sharper the Gaussian (and more symmetric). This is illustrated in Figure 4 in the case where ν1 = 1000. Such situations, whilst rare in traditional statistical applications, may often be encountered in chemometrics where there may be a large number of variables, although would still require large sample sizes. Note that the mode changes position as the distribution becomes more symmetric as ν2 increases. In traditional statistical texts, it is usual to present F distribution tables. Because there are rather many possible F distributions, these are usually presented as critical values. A critical value of p = 0.01 gives the value of the F statistic that is expected to be exceeded by only 1% of the data, or in some cases, this can be called the 99% confidence limit. These tables are self-evident and are given in Tables 1 and 2 for two critical values. Note that F tables can be presented for different critical values. There are several more comprehensive tables available on the web 3, 4 although it is recommended that p values are calculated in Excel or any other common environment. Note that the tables later are presented for the one-tailed F cdf in this article. In some contexts, it is appropriate to look at two-tailed F tests, but we will not at this phase be concerned with this.
Read full abstract