Abstract
Many have suggested a bootstrap procedure for estimating the sampling variability of principal component analysis (PCA) results. However, when the number of measurements per subject (p) is much larger than the number of subjects (n), calculating and storing the leading principal components (PCs) from each bootstrap sample can be computationally infeasible. To address this, we outline methods for fast, exact calculation of bootstrap PCs, eigenvalues, and scores. Our methods leverage the fact that all bootstrap samples occupy the same n-dimensional subspace as the original sample. As a result, all bootstrap PCs are limited to the same n-dimensional subspace and can be efficiently represented by their low-dimensional coordinates in that subspace. Several uncertainty metrics can be computed solely based on the bootstrap distribution of these low-dimensional coordinates, without calculating or storing the p-dimensional bootstrap components. Fast bootstrap PCA is applied to a dataset of sleep electroencephalogram recordings (p = 900, n = 392), and to a dataset of brain magnetic resonance images (MRIs) (p ≈ 3 million, n = 352). For the MRI dataset, our method allows for standard errors for the first three PCs based on 1000 bootstrap samples to be calculated on a standard laptop in 47 min, as opposed to approximately 4 days with standard methods. Supplementary materials for this article are available online.
Highlights
Principal component analysis (PCA) (Jolliffe, 2005) is a dimension reduction technique that is widely used fields such as genomics, survey analysis, and image analysis
The remaining four principal components (PCs) (PC2, PC3, PC4, and PC5) roughly correspond to different types of oscillatory patterns in the early hours of sleep. These components are fairly similar to the results found by (Di et al, 2009), who analyze a different subset of the data, and employ a smooth multilevel functional PCA approach to estimate eigenfunctions that differentiate subjects from one another
We find in approximately 4% of bootstrap samples from the magnetic resonance images (MRIs) dataset, that a solution to the singular value decomposition (SVD) of DU′Pb exists, the SVD function fails to converge
Summary
Principal component analysis (PCA) (Jolliffe, 2005) is a dimension reduction technique that is widely used fields such as genomics, survey analysis, and image analysis. When applying the bootstrap to PCA in the high dimensional setting, the challenge of calculating and storing the PCs from each bootstrap sample can make the procedure computationally infeasible. In order to create highly informative feature variables, PCA determines the set of orthonormal basis vectors such that the subjects' coordinates with respect to these new basis vectors are maximally variable (Jolliffe, 2005) These new basis vectors are called the sample principal components (PCs), and the subjects coordinates with respect to these basis vectors are called the sample scores. Both the sample PCs and sample scores can be calculated via the singular value decomposition (SVD) of the sample data matrix. Recalculating the SVD for all B bootstrap samples has a computation complexity of order O(Bpn2), which can make the procedure computationally infeasible when p is very large
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.