Abstract

In the wide-ranging scope of modern statistical data analysis, a key task is identification of outliers. For any outlier identification procedure, one needs to know its robustness against masking (an “outlier” is undetected as such) and swamping (a “nonoutlier” is classified as an “outlier”). Masking and swamping robustness are interrelated aspects which must be studied together. For such purposes, Serfling and Wang (2014) provide a general framework applicable in any data space. Implementation, however, with particular outlier identifiers in particular types of data space, requires additional theoretical development specialized to the chosen setting. Even the case of univariate data presents nontrivial challenges. Here we apply the framework to study the masking and swamping robustness properties of two leading types of nonparametric outlier identifiers, scaled deviation outlyingness and centered rank outlyingness. The results shed new light on the choice between (Median, MAD) and (trimmed mean, trimmed standard deviation) in using scaled deviation outlyingness. Also, our findings explain how the boxplot, a leading descriptive tool, performs using a hybrid outlyingness function incorporating a quantile-based component to describe the middle half of a data set and a scaled deviation outlyingness component for outlier detection. For both goals, the boxplot greatly favors swamping robustness over masking robustness. We also formulate a variant boxplot offering a more favorable trade-off between these two criteria.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call