Abstract

This is a study of principal component analysis performed on a statistical sample. We assume that this data sample is made of independent copies of some random variable ranging in a separable real Hilbert space. This covers data in function spaces as well as data represented in reproducing kernel Hilbert spaces. Based on some new inequalities about the perturbation of nonnegative self-adjoint operators, we provide new bounds for the statistical fluctuations of the principal component representation with the draw of the statistical sample. We suggest two kinds of improvements to decrease these fluctuations: the first is to use a robust estimate of the covariance operator, for which non-asymptotic bounds of the estimation error are available under weak polynomial moment assumptions. The second improvement is to use some modification of the projection on the principal components based on functional calculus applied to the covariance operator. Using this modified projection, we can obtain bounds that do not depend on the spectral gap but on some more favorable factor. In appendix, we provide a new approach to the analysis of the relative positions of two orthogonal projections that is useful for our proofs and that has an interest of its own.

Highlights

  • We suggest two kinds of improvements to decrease these fluctuations: the first is to use a robust estimate of the covariance operator, for which non-asymptotic bounds of the estimation error are available under weak polynomial moment assumptions

  • Principal Component Analysis (PCA) is a classical tool for dimensionality reduction that relies on the spectral properties of the covariance matrix

  • Our goal is to improve these bounds by improving on one hand the choice of the estimator Σ of the covariance matrix and on the other hand the choice of the representation itself, to make principal component analysis more robust to statistical fluctuations depending on the draw of the sample (X1, . . . , Xn)

Read more

Summary

Introduction

Principal Component Analysis (PCA) is a classical tool for dimensionality reduction that relies on the spectral properties of the covariance matrix. Our goal is to improve these bounds by improving on one hand the choice of the estimator Σ of the covariance matrix and on the other hand the choice of the representation itself, to make principal component analysis more robust to statistical fluctuations depending on the draw of the sample This is where our modified projection may help: it will remain weakly dependent on the statistical sample choice (the precise meaning of this statement being provided by a non-asymptotic bound), even when no large spectral gap is available.

Objectives
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call