Abstract

Dimension reduction procedures have received increasing consideration over the past decades. Despite this attention, the effect of data contamination or outlying data points in dimension reduction is, however, not well understood, and is compounded by the issue that outliers can be difficult to classify in the presence of many variables. This paper formally investigates the influence of data contamination for sliced inverse regression (SIR), which is a prototypical dimension reduction procedure that targets a lower-dimensional subspace of a set of regressors needed to explain a response variable. We establish a general theory for how estimated reduction subspaces can be distorted through both the number and direction of outlying data points. The results depend critically on the regressor covariance structure and the most harmful types of data contamination are shown to differ in cases where this covariance structure is known or unknown. For example, if the covariance structure is estimated, data contamination is proven to produce an estimated subspace that is automatically orthogonal to the directions of outlying data points, constituting a potentially serious loss of information. Our main results demonstrate the degree to which data contamination indeed causes incorrect dimension reduction, depending on the amount, magnitude, and direction of contamination. Further, by metricizing distances between dimension reduction subspaces, worst case results for data contamination can be formulated to define a finite sample breakdown point for SIR as a measure of global robustness. Our theoretical findings are illustrated through simulation.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call