A Little‐known Robust Estimator of the Correlation Coefficient and Its Use in a Robust Graphical Test for Bivariate Normality with Applications in the Aluminium Industry

Oystein Evandt,M F Ramalhoto,Carolyn Van Lottum,Shirley Coleman

doi:10.1002/qre.658

Abstract

AbstractIndustrial and business data often contain outliers. The reasons why outliers occur can be unclear procedures for production tasks or measurement, operators who do not follow procedures, failures in production equipment or measurement equipment, the wrong type of raw material, failure in raw material, registration errors or the fact that the response is influenced by many other factors as well as the available explanatory variables. Often there is no identifiable cause for the outliers and they are considered to be an intrinsic part of the dataset. Since data are often considered pairwise, and more methods for analysing pairwise data are available if the data‐generating process can be modelled by a bivariate normal distribution, there is a need for a straightforward test of bivariate normality that is robust against outliers. This paper looks at a graphical test, based on probability plotting, for assessing whether it is reasonable to assume that a bivariate dataset stems from an approximately bivariate normal distribution, where the possibility for outliers is taken into account. The robust graphical (Robug) test uses a little‐known estimator of the correlation coefficient, which is demonstrated to be robust against outliers. The graphical test is illustrated using data from our practical work. First the little‐known robust estimator of the correlation parameter in the bivariate normal distribution is compared with the traditional estimator, the product moment correlation coefficient, often called Pearson's r, and Spearman's rank correlation coefficient and Kendall's tau. The little‐known estimator is a transformation of Kendall's tau. The comparison is partly based on theory, and partly on the simulation of observations from the bivariate normal distribution. Our conclusions are that when outliers are not an issue, Pearson's r, Spearman's coefficient and the transformation of Kendall's tau do not perform very differently in terms of bias, standard deviation and root mean square error, while Kendall's tau is too biased to be used for the purpose in question. Concerning robustness to outliers, Pearson's r is inferior to the other estimators. It seems likely that the transformation of Kendall's tau, which is far less well‐known than Pearson's r and Spearman's rank correlation coefficient, is at least as good as Spearman's coefficient when the possibility of outliers must be taken into consideration. Business and industrial improvement often requires the use of information that can be extracted from multivariate data. When the multivariate normal (MVN) distribution can be used to model the data‐generating process, more methods are generally available for analysing the data and providing predictions. Many datasets are naturally approximately MVN so that deviations from normality imply special causes. Thus, tests for MVN facilitate the detection of outliers. Considerable insight is gained by looking at the data singly or pairwise. Pairwise datasets that come from a process that can be modelled as MVN, can be modelled by a bivariate normal distribution. The robust graphical test in this paper is therefore also useful for assessing whether a multivariate dataset comes from an approximate MVN distribution. Copyright © 2004 John Wiley & Sons, Ltd.

Full Text