Detecting Outliers and Influential and Sensitive Observations in Linear Regression

Daniel Peña

doi:10.1007/978-1-4471-7503-2_31

Abstract

This chapter reviews diagnostic and robust procedures for detecting outliers and other interesting observations in linear regression. First, we present statistics for detecting single outliers and influential observations and show their limitations for multiple outliers in high-leverage situations. Second, we discuss diagnostic procedures designed to avoid masking by finding first a clean subset for estimating the parameters and then increasing its size by incorporating, one by one, new homogeneous observations until a heterogeneous observation is found. We also discuss procedures based on sensitive observations for detecting high-leverage outliers in large data sets using the eigenvectors of a sensitivity matrix. We briefly review robust estimation methods and its relationship with diagnostic procedures. Next, we consider large high-dimensional data sets where the application of iterative procedures can be slow and show that the joint use of simple univariate statistics, as predictive residuals, Cook’s distances, and Peña’s sensitivity statistic, can be a useful diagnostic tool. We also comment on other recent procedures based on regularization and sparse estimation and conclude with a brief analysis of the relationship of outlier detection and cluster analysis. A real data and a simulated example are presented to illustrate the procedures presented in the chapter.

Full Text