Abstract
The process of detection of outliers is an interesting and important aspect in the analysis of data, as it could impact the inference. Literature is abundant with procedures for detection and testing of single outliers in sample data. However, the presence of two or more outliers in multivariate data would render the detection and testing process more complicated as majority of outliers are invisible to many of the methods. This is due to the masking effect, and regular classical and related methods being found unsuitable for use of outlier identification techniques. The difficulty of detection increases with the number of outliers and the dimension of the data because the outliers can be extreme in any growing number of directions. An overview of multivariate outlier detection methods are provided in this study because of its growing importance in a wide variety of practical situations. DOI: http://dx.doi.org/10.4038/sljastats.v14i2.6214
Highlights
Statisticians have always been interested in finding “outlying”, “unusual”, or “unrepresentative” observations for many years as a precursor to data analysis
There are numerous robust distance-based outlier detection methods evolved over the last two decades and the following are the findings presented in order
The primary theorem proved by Rousseeuw and Van Driessen states that if one starts with a half-sample of data, orders the entire data set based on Mahalanobis distances derived from the half-sample’s mean vector and covariance matrix, and selects a new half-sample from the observations with smallest distances, the covariance determinant of the new half-sample will be less than or equal to the old half-sample covariance determinant
Summary
Statisticians have always been interested in finding “outlying”, “unusual”, or “unrepresentative” observations for many years as a precursor to data analysis. The average gestation period for a human female is 280 days, and so the question arose regarding 349 days being as a large observation or does that data point belong to another population, namely one of women who conceived much later than August 28, 1944 [1] Another example where outliers themselves are of primary importance involves air safety, as discussed in [4]. If there were other types of mosquitoes in his data collection, he would not be interested in their characteristics, he would want to remove the observations or ensure that the observations do not influence the statistical estimates of the original population In such a situation, the techniques should accommodate the outliers but need not detect and reject them in the estimation and are called robust. If the variability is due to inherent variation, the point should remain
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have