Abstract

Conventional principal component analysis is highly susceptible to outliers. In particular, a sufficiently outlying single data point, can draw the leading principal component toward itself. In this paper, we study the effects of outliers for high dimension and low sample size data, using asymptotics. The non-robust nature of conventional principal component analysis is verified through inconsistency under multivariate Gaussian assumptions with a single spike in the covariance structure, in the presence of a contaminating outlier. In the same setting, the robust method of spherical principal components is consistent with the population eigenvector for the spike model, even in the presence of contamination.

Highlights

  • Principal components analysis (PCA) is widely used for high dimensional data (Jolliffe [1]), including high dimension, low sample size (HDLSS) data

  • Both the sample mean and covariance are sensitive to outlying observations, and so classical PCA tends to be unreliable in the presence of outliers

  • Centering using the L1 M-estimate is recommended (Locantore et al [6]), because that is intuitively consistent with spherical PCA

Read more

Summary

Introduction

Principal components analysis (PCA) is widely used for high dimensional data (Jolliffe [1]), including high dimension, low sample size (HDLSS) data. Devlin and Gnanadesikan [2] did an eigen analysis of a robust estimate of the covariance matrix to develop a robust version of PCA. The asymptotic behavior of classical PCA for HDLSS data has been established by Jung and Marron [8] under various versions of the spike eigenvalue model, with one or only a few large eigenvalues (Johnstone and Silverman [9]). They explored conditions under which the conventional PCA was consistent in terms of the spike parameter α. Robustness with respect to outliers and SPCA are for the first time studied rigorously in the HDLSS asymptotic context

Notation
Spiked covariance model
Spherical PCA
Consistency and strong inconsistency
Underlying Gaussian model
Impact of outliers
Muti-spike model
Discussion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.