Fast, Exact Bootstrap Principal Component Analysis for p > 1 Million

Aaron Fisher,Brian Caffo,Brian Schwartz,Vadim Zipunnikov

doi:10.1080/01621459.2015.1062383

Abstract

Many have suggested a bootstrap procedure for estimating the sampling variability of principal component analysis (PCA) results. However, when the number of measurements per subject (p) is much larger than the number of subjects (n), calculating and storing the leading principal components (PCs) from each bootstrap sample can be computationally infeasible. To address this, we outline methods for fast, exact calculation of bootstrap PCs, eigenvalues, and scores. Our methods leverage the fact that all bootstrap samples occupy the same n-dimensional subspace as the original sample. As a result, all bootstrap PCs are limited to the same n-dimensional subspace and can be efficiently represented by their low-dimensional coordinates in that subspace. Several uncertainty metrics can be computed solely based on the bootstrap distribution of these low-dimensional coordinates, without calculating or storing the p-dimensional bootstrap components. Fast bootstrap PCA is applied to a dataset of sleep electroencephalogram recordings (p = 900, n = 392), and to a dataset of brain magnetic resonance images (MRIs) (p ≈ 3 million, n = 352). For the MRI dataset, our method allows for standard errors for the first three PCs based on 1000 bootstrap samples to be calculated on a standard laptop in 47 min, as opposed to approximately 4 days with standard methods. Supplementary materials for this article are available online.

Highlights

Principal component analysis (PCA) (Jolliffe, 2005) is a dimension reduction technique that is widely used fields such as genomics, survey analysis, and image analysis
The remaining four principal components (PCs) (PC2, PC3, PC4, and PC5) roughly correspond to different types of oscillatory patterns in the early hours of sleep. These components are fairly similar to the results found by (Di et al, 2009), who analyze a different subset of the data, and employ a smooth multilevel functional PCA approach to estimate eigenfunctions that differentiate subjects from one another
We find in approximately 4% of bootstrap samples from the magnetic resonance images (MRIs) dataset, that a solution to the singular value decomposition (SVD) of DU′Pb exists, the SVD function fails to converge

Summary

Introduction

Principal component analysis (PCA) (Jolliffe, 2005) is a dimension reduction technique that is widely used fields such as genomics, survey analysis, and image analysis. When applying the bootstrap to PCA in the high dimensional setting, the challenge of calculating and storing the PCs from each bootstrap sample can make the procedure computationally infeasible. In order to create highly informative feature variables, PCA determines the set of orthonormal basis vectors such that the subjects' coordinates with respect to these new basis vectors are maximally variable (Jolliffe, 2005) These new basis vectors are called the sample principal components (PCs), and the subjects coordinates with respect to these basis vectors are called the sample scores. Both the sample PCs and sample scores can be calculated via the singular value decomposition (SVD) of the sample data matrix. Recalculating the SVD for all B bootstrap samples has a computation complexity of order O(Bpn2), which can make the procedure computationally infeasible when p is very large

Fast bootstrap PCA – resampling is a low dimensional transformation

Motivating data

Sleep EEG

Brain magnetic resonance images

Full description of the bootstrap PCA algorithm

Adjusting for axis reflections of the principal components

Bootstrap moments of the principal components

Construction of confidence regions

Maintaining informative rotational variability

Coverage rate simulations

Simulation results

Brain MRIs

Findings

Discussion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Journal of the American Statistical Association	Publication Date: Apr 2, 2016
Citations: 60	License type: cc-by

R Discovery Prime

R Discovery Prime

Fast, Exact Bootstrap Principal Component Analysis for p > 1 Million

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Journal of the American Statistical Association

Lead the way for us

Similar Papers

FAST-PCA: A Fast and Exact Algorithm for Distributed Principal Component Analysis
Arpita Gang ... Waheed U Bajwa
IEEE Transactions on Signal Processing | VOL. 70
Arpita Gang, et. al.Arpita Gang ... Waheed U Bajwa
01 Jan 2021
IEEE Transactions on Signal Processing | VOL. 70

Alzheimer Detection Using MRI Imaging Modality
...
-
, et. al. ...
17 Mar 2016
17 Mar 2016

Author response: Sparse dimensionality reduction approaches in Mendelian randomisation with highly correlated exposures
Vasileios Karageorgiou ... Verena Zuber
-
Vasileios Karageorgiou, et. al.Vasileios Karageorgiou ... Verena Zuber
28 Nov 2022
28 Nov 2022

Author response: Limitations of principal components in quantitative genetic association models for human studies
Yiqi Yao ... Alejandro Ochoa
-
Yiqi Yao, et. al.Yiqi Yao ... Alejandro Ochoa
25 Apr 2023
25 Apr 2023

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Fast, Exact Bootstrap Principal Component Analysis for p > 1 Million

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Journal of the American Statistical Association