CAN PRINCIPAL COMPONENT ANALYSIS PRESERVE THE SPARSITY IN FACTOR LOADINGS?

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon
Take notes icon Take Notes

This article studies the principal component analysis (PCA) estimation of weak factor models with sparse loadings. We uncover an intrinsic near-sparsity preservation property for the PCA estimators of loadings, which comes from the approximately (block) upper triangular structure of the rotation matrix. It suggests an asymmetric relationship among factors: the sparsity of the rotated loadings for a stronger factor can be contaminated by the loadings from weaker ones, but the sparsity of the rotated loadings of a weaker factor is almost unaffected by the loadings of stronger ones. Then, we propose a simple alternative to the existing penalized approaches to sparsify the loading estimators by screening out the small PCA loading estimators directly, and construct consistent estimators for factor strengths. The proposed estimators perform well in finite samples, as shown by a set of Monte Carlo simulations.

Similar Papers
  • Research Article
  • Cite Count Icon 2
  • 10.1080/03610918.2021.1898638
Performances of some high dimensional regression methods: sparse principal component regression
  • Mar 14, 2021
  • Communications in Statistics - Simulation and Computation
  • Fatma Sevinç Kurnaz

Principal component analysis (PCA) is widely used technique in data processing and dimensionality reduction, but it has some drawbacks since each principal component is a linear combination of explanatory variables. As an alternative, the sparse PCA (SPCA) is a very appealing method which produces principal components with sparse loadings. On the other hand, combining PCA on explanatory variables with least squares regression yields to principal component regression (PCR). In PCR, the components are obtained using only explanatory variables, not considered the effect of the dependent variable. Considering the dependent variable, the sparse PCR (SPCR) enables to obtain sparse principal component loadings. But the main drawback of it is the computational cost. Taking into consideration the general structure of PCR, we combine (S)PCA with some sparse regression methods and compared with the classical PCR and last introduced method, SPCR. Extensive simulation studies and real data examples are implemented to show their performances. The results are supported by a reasonable computation time study.

  • Preprint Article
  • 10.13140/rg.2.2.23601.04965
Does Principal Component Analysis Preserve the Sparsity in Sparse Weak Factor Models?
  • May 10, 2023
  • arXiv (Cornell University)
  • Jie Wang + 1 more

This paper studies the principal component (PC) method-based estimation of weak factor models with sparse loadings. We uncover an intrinsic near-sparsity preservation property for the PC estimators of loadings, which comes from the approximately upper triangular (block) structure of the rotation matrix. It implies an asymmetric relationship among factors: the rotated loadings for a stronger factor can be contaminated by those from a weaker one, but the loadings for a weaker factor is almost free of the impact of those from a stronger one. More importantly, the finding implies that there is no need to use complicated penalties to sparsify the loading estimators. Instead, we adopt a simple screening method to recover the sparsity and construct estimators for various factor strengths. In addition, for sparse weak factor models, we provide a singular value thresholding-based approach to determine the number of factors and establish uniform convergence rates for PC estimators, which complement Bai and Ng (2023). The accuracy and efficiency of the proposed estimators are investigated via Monte Carlo simulations. The application to the FRED-QD dataset reveals the underlying factor strengths and loading sparsity as well as their dynamic features.

  • Book Chapter
  • Cite Count Icon 6
  • 10.1007/978-3-030-29859-3_52
Algorithm for Constructing a Classifier Team Using a Modified PCA (Principal Component Analysis) in the Task of Diagnosis of Acute Lymphocytic Leukaemia Type B-CLL
  • Jan 1, 2019
  • Mariusz Topolski + 1 more

Systems of data recognition and data classification are getting more and more developed. There appear newer algorithms that solve more difficult and complex decision problems. Very good results are obtained using sets of classifiers. The authors in their research focused on certain data characteristics. The characteristics concerns recognition of classes of objects whose features can be grouped. Clusters created in this manner can contribute to better recognition of certain decision classes. One such example is a diagnosis of forecast in the case of acute lymphocytic chronic leukaemia B-CLL type. In this document, the authors present a modified selection method of features of the PCA object. The modification concerns the rotation of objects in relation to decision classes. In addition to grouping similar features using Varimax rotation, a procedure for grouping patients in these PCA groups was developed. Within each PCA, two classifiers - strong and weak ones were built. In the research part, the developed method was compared to the one-stage recognition algorithms known from the literature. The obtained results have a significant contribution to medical diagnostics. They allow to develop a procedure for treatment of B-CLL lymphocytic leukaemia. Making an appropriate diagnosis allows to increase a patient’s survival chance by implementing appropriate treatment.

  • Book Chapter
  • Cite Count Icon 7
  • 10.1007/978-3-319-03680-9_16
Sparse Principal Component Analysis via Joint L 2,1-Norm Penalty
  • Jan 1, 2013
  • Shi Xiaoshuang + 5 more

Sparse principal component analysis (SPCA) is a popular method to get the sparse loadings of principal component analysis(PCA), it represents PCA as a regression model by using lasso constraint, but the selected features of SPCA are independent and generally different with each principal component (PC). Therefore, we modify the regression model by replacing the elastic net with L 2,1-norm, which encourages row-sparsity that can get rid of the same features in different PCs, and utilize this new self-contained regression model to present a new framework for graph embedding methods, which can get sparse loadings via L 2,1-norm. Experiment on Pitprop data illustrates the row-sparsity of this modified regression model for PCA and experiment on YaleB face database demonstrates the effectiveness of this model for PCA in graph embedding.

  • Research Article
  • Cite Count Icon 27
  • 10.1016/j.chemolab.2015.06.014
Adaptive sparse principal component analysis for enhanced process monitoring and fault isolation
  • Jun 26, 2015
  • Chemometrics and Intelligent Laboratory Systems
  • Kangling Liu + 4 more

Adaptive sparse principal component analysis for enhanced process monitoring and fault isolation

  • Research Article
  • Cite Count Icon 10
  • 10.1016/j.patrec.2012.06.010
Feature selection from high-order tensorial data via sparse decomposition
  • Jun 21, 2012
  • Pattern Recognition Letters
  • Donghui Wang + 1 more

Feature selection from high-order tensorial data via sparse decomposition

  • Research Article
  • Cite Count Icon 3003
  • 10.1198/106186006x113430
Sparse Principal Component Analysis
  • Jun 1, 2006
  • Journal of Computational and Graphical Statistics
  • Hui Zou + 2 more

Principal component analysis (PCA) is widely used in data processing and dimensionality reduction. However, PCA suffers from the fact that each principal component is a linear combination of all the original variables, thus it is often difficult to interpret the results. We introduce a new method called sparse principal component analysis (SPCA) using the lasso (elastic net) to produce modified principal components with sparse loadings. We first show that PCA can be formulated as a regression-type optimization problem; sparse loadings are then obtained by imposing the lasso (elastic net) constraint on the regression coefficients. Efficient algorithms are proposed to fit our SPCA models for both regular multivariate data and gene expression arrays. We also give a new formula to compute the total variance of modified principal components. As illustrations, SPCA is applied to real and simulated data with encouraging results.

  • Research Article
  • Cite Count Icon 2
  • 10.1007/s00362-018-1045-6
Sparse common component analysis for multiple high-dimensional datasets via noncentered principal component analysis
  • Sep 22, 2018
  • Statistical Papers
  • Heewon Park + 1 more

There is currently much discussion about the analysis of multiple datasets from different groups, among which especially identifying a common basic structure of multiple groups has drawn a large amount of attention. In order to identify a common basic structure, common component analysis (CCA) was proposed by generalizing techniques for principal component analysis (PCA); i.e., CCA becomes standard PCA when applied to only one dataset. Although CCA can identify the common structure of multiple datasets, which cannot be extracted by standard PCA, CCA suffers from the following drawbacks. The common components are estimated as linear combinations of all variables, and thus it is difficult to interpret the identified common components. The fully dense loadings lead to erroneous results in CCA, because noisy features are inevitably included in datasets. To address these issues, we incorporate sparsity into CCA, and propose a novel strategy for sparse common component analysis based on $$L_{1}$$ -type regularized regression modeling. We focus CCA which is formulated as the eigenvalue decomposition (EVD) of a Gram matrix (i.e., common loadings of multiple datasets can be estimated by EVD of a Gram matrix), and it can be performed by Singular value decomposition of a square root of the Gram matrix. We then propose sparse common component analysis based on sparse PCA to estimate sparse common loadings of multiple datasets. We also propose an algorithm to estimate sparse common loadings of multiple datasets. The proposed method can not only identify a common subspace but also select crucial common-features for multiple groups. Monte Carlo simulations and real-data analysis are conducted to examine the efficiency of the proposed sparse CCA. We observe from the numerical studies that our strategies can incorporate sparsity into the common loading estimation and efficiently recover a sparse common structure efficiently in multiple dataset analysis.

  • Research Article
  • Cite Count Icon 335
  • 10.1214/08-aos618
Finite sample approximation results for principal component analysis: A matrix perturbation approach
  • Dec 1, 2008
  • The Annals of Statistics
  • Boaz Nadler

Principal component analysis (PCA) is a standard tool for dimensional reduction of a set of n observations (samples), each with p variables. In this paper, using a matrix perturbation approach, we study the nonasymptotic relation between the eigenvalues and eigenvectors of PCA computed on a finite sample of size n, and those of the limiting population PCA as n→∞. As in machine learning, we present a finite sample theorem which holds with high probability for the closeness between the leading eigenvalue and eigenvector of sample PCA and population PCA under a spiked covariance model. In addition, we also consider the relation between finite sample PCA and the asymptotic results in the joint limit p, n→∞, with p/n=c. We present a matrix perturbation view of the “phase transition phenomenon,” and a simple linear-algebra based derivation of the eigenvalue and eigenvector overlap in this asymptotic limit. Moreover, our analysis also applies for finite p, n where we show that although there is no sharp phase transition as in the infinite case, either as a function of noise level or as a function of sample size n, the eigenvector of sample PCA may exhibit a sharp “loss of tracking,” suddenly losing its relation to the (true) eigenvector of the population PCA matrix. This occurs due to a crossover between the eigenvalue due to the signal and the largest eigenvalue due to noise, whose eigenvector points in a random direction.

  • Conference Article
  • Cite Count Icon 2
  • 10.1109/waina.2015.135
Broadcast Protocols in Wireless Networks
  • Mar 1, 2015
  • Kazuaki Kouno + 3 more

A group of multiple nodes are cooperating with one another in order to achieve some objectives in distributed applications. In this paper, we would like to discuss how to broadcast messages to every node in a group which are interconnected in wireless networks. In order to reduce the number of messages in flooding protocols, only relay nodes forward messages as discussed in the MPR (Multi-Point Relay) protocol. Here, in one round, nodes covered by a root node are taken in the first- and second-neighbor nodes. By iterating rounds, a spanning tree covering every node is obtained. Here, nodes are selected in root-to-leaf manner. In this paper, we newly propose N2N3, N3N3, and MN3N3 broadcast protocols where third-neighbor nodes in addition to the first- and second-neighbor nodes are covered in one round. We consider two types of relay nodes, strong and weak ones, which broadcast messages with strongest wave intensity and unicast messages with weaker radio wave intensity, respectively. A root node is a strong one. A first-neighbor node of the root node is weak and second-neighbor node is strong. Thus, nodes forward messages alternately with strong and weaker radio wave intensity. In the proposed protocols, nodes are selected in leaf-to-root manner. In one round of the N2N3 protocol, relay nodes are first selected in second-neighbor nodes and then relay nodes in the first-neighbor nodes are selected. On the other hand, third-neighbor nodes are first selected as relay nodes and second-neighbor and first-neighbor nodes are selected as weak and strong relay nodes, respectively in one round of the N3N3 protocol. In the N2N3 protocol, every third-neighbor node may not be covered by a root node in each round while every third-neighbor node is covered in the N3N3 protocol.

  • Conference Article
  • Cite Count Icon 7
  • 10.1109/cisp.2009.5301566
Detection of Chirp Signal by Combination of Kurtosis Detection and Filtering in Fractional Fourier Domain
  • Oct 1, 2009
  • Qin Yali + 3 more

The chirp signal has been used widely in radar signals. As a useful signal processing technique, the fractional Fourier transform (FRFT) is a way to concentrate the energy of a chirp signal. Therefore, the FRFT presents a potentially effective technique for detection of chirp signals. Compared with the common Wigner-Vill distribution (WVD) algorithm, the FRFT is a linear operator, and will not be influenced by cross-terms even if multiple components exist. Moreover, to solve the problem whereby weak components of the signal are shadowed by the sidelobes of strong ones, a new method of combination of kurtosis detection and filtering in fractional Fourier domain is proposed. In this way strong components and weak ones can be detected iteratively. The effectiveness of this combined method has been tested through simulation.

  • Research Article
  • Cite Count Icon 3
  • 10.1016/j.jeconom.2023.02.002
Shrinkage estimation of multiple threshold factor models
  • Feb 21, 2023
  • Journal of Econometrics
  • Chenchen Ma + 1 more

Shrinkage estimation of multiple threshold factor models

  • Peer Review Report
  • 10.7554/elife.80063.sa2
Author response: Sparse dimensionality reduction approaches in Mendelian randomisation with highly correlated exposures
  • Nov 28, 2022
  • Vasileios Karageorgiou + 3 more

Full text Figures and data Side by side Abstract Editor's evaluation Introduction Results Discussion Materials and methods Appendix 1 Data availability References Decision letter Author response Article and author information Metrics Abstract Multivariable Mendelian randomisation (MVMR) is an instrumental variable technique that generalises the MR framework for multiple exposures. Framed as a regression problem, it is subject to the pitfall of multicollinearity. The bias and efficiency of MVMR estimates thus depends heavily on the correlation of exposures. Dimensionality reduction techniques such as principal component analysis (PCA) provide transformations of all the included variables that are effectively uncorrelated. We propose the use of sparse PCA (sPCA) algorithms that create principal components of subsets of the exposures with the aim of providing more interpretable and reliable MR estimates. The approach consists of three steps. We first apply a sparse dimension reduction method and transform the variant-exposure summary statistics to principal components. We then choose a subset of the principal components based on data-driven cutoffs, and estimate their strength as instruments with an adjusted F-statistic. Finally, we perform MR with these transformed exposures. This pipeline is demonstrated in a simulation study of highly correlated exposures and an applied example using summary data from a genome-wide association study of 97 highly correlated lipid metabolites. As a positive control, we tested the causal associations of the transformed exposures on coronary heart disease (CHD). Compared to the conventional inverse-variance weighted MVMR method and a weak instrument robust MVMR method (MR GRAPPLE), sparse component analysis achieved a superior balance of sparsity and biologically insightful grouping of the lipid traits. Editor's evaluation This paper investigated the identification of causal risk factors on health outcomes. It applies sparse dimension reduction methods on highly correlated traits in the Mendelian randomization framework. The implementation of this method helps to identify risk factors when given high dimensional traits data. https://doi.org/10.7554/eLife.80063.sa0 Decision letter Reviews on Sciety eLife's review process Introduction Mendelian randomisation (MR) is an epidemiological study design that uses genetic variants as instrumental variables (IVs) to investigate the causal effect of a genetically predicted exposure on an outcome of interest (Smith and Ebrahim, 2003). In a randomised controlled trial (RCT) the act of randomly allocating patients to different treatment groups precludes the existence of systematic confounding between the treatment and outcome and therefore provides a strong basis for causal inference. Likewise, the alleles that determine a small proportion of variation of the exposure in MR are inherited randomly. We can therefore view the various genetically proxied levels of a lifelong modifiable exposure as a 'natural' RCT, avoiding the confounding that hinder traditional observational associations. Genetically predicted levels of an exposure are also less likely to be affected by reverse causation, as genetic variants are allocated before the onset of the outcomes of interest. When evidence suggests that multiple correlated phenotypes may contribute to a health outcome, multivariable MR (MVMR), an extension of the basic univariable approach can disentangle more complex causal mechanisms and shed light on mediating pathways. Following the analogy with RCTs, the MVMR design is equivalent to a factorial trial, in which patients are simultaneously randomised to different combinations of treatments (Burgess and Thompson, 2015). An example of this would be investigation into the effect of various lipid traits on coronary heart disease (CHD) risk (Burgess and Harshfield, 2016). While MVMR can model correlated exposures, it performs suboptimally when there are many highly correlated exposures due to multicollinearity in their genetically proxied values. This can be equivalently understood as a problem of conditionally weak instruments (Sanderson et al., 2019) that is only avoided if the genetic instruments are strongly associated with each exposure conditionally on all the other included exposures. An assessment of the extent to which this assumption is satisfied can be made using the conditional F-statistic, with a value of 10 for all exposures being considered sufficiently strong (Sanderson et al., 2019). In settings when multiple highly correlated exposures are analysed, a set of genetic instruments are much more likely to be conditionally weak instruments. In this event, causal estimates can be subject to extreme bias and are therefore unreliable. Estimation bias can be addressed to a degree by fitting weak instrument robust MVMR methods (Sanderson et al., 2020; Wang et al., 2021), but at the cost of a further reduction in precision. Furthermore, MVMR models investigate causal effects for each individual exposure, under the assumption that it is possible to intervene and change each one whilst holding the others fixed. In the high-dimensional, highly correlated exposure setting, this is potentially an unachievable intervention in practice. Our aim in this paper is instead to use dimensionality reduction approaches to concisely summarise a set of highly correlated genetically predicted exposures into a smaller set of independent principal components (PCs). We then perform MR directly on the PCs, thereby estimating their effect on health outcomes of interest. We additionally suggest employing sparsity methods to reduce the number of exposures that contribute to each PC, in order to improve their interpretability in the resulting factors. Using summary genetic data for multiple highly correlated lipid fractions and CHD (Kettunen et al., 2016; Nelson et al., 2017), we first illustrate the pitfalls encountered by the standard MVMR approach. We then apply a range of sparse principal component analysis (sPCA) methods within an MVMR framework to the data. Finally, we examine the comparative performance of the sPCA approaches in a detailed simulation study, in a bid to understand which ones perform best in this setting. Results Workflow overview Our proposed analysis strategy is presented in Figure 1. Using summary statistics for the single-nucleotide polymorphism (SNP)-exposure (γ^) and SNP-outcome (Γ^) association estimates, where γ^ (dimensionality 148 SNPs× 97 exposures) exhibits strong correlation, we initially perform a PCA on γ^. Additionally, we perform multiple sPCA modalities that aim to provide sparse loadings that are more interpretable (block 3, Figure 1). The choice of the number of PCs is guided by permutation testing or an eigenvalue threshold. Finally, the PCs are used in place of γ^ in an IVW MVMR meta-analysis to obtain an estimate of the causal effect of the PC on the outcome. Similar to PC regression and in line with unsupervised methods, the outcome (SNP-outcome associations (Γ^) and corresponding standard error (S⁢EΓ^)) is not transformed by PCA and is used in the second-step MVMR in the original scale. In the real data application and in the simulation study, the best balance of sparsity and statistical power was observed for the method of sparse component analysis (SCA) (Chen and Rohe, 2021). This favoured method and the related steps are coded in an R function and are available at GitHub (https://github.com/vaskarageorg/SCA_MR/, copy archived at Karageorgiou, 2023). Figure 1 Download asset Open asset Proposed workflow. Step 1: MVMR on a set of highly correlated exposures. Each genetic variant contributes to each exposure. The high correlation is visualised in the similarity of the single-nucleotide polymorphism (SNP)-exposure associations in the correlation heatmap (top right). Steps 2 and 3: PCA and sparse PCA on γ^. Step 4. MVMR analysis on a low dimensional set of principal components (PCs). X: exposures; Y: outcome; k: number of exposures; PCA: principal component analysis; MVMR: multivariable Mendelian randomisation. UVMR and MVMR A total of 66 traits were associated with CHD at or below the Bonferroni-corrected level (p=0.05/97, Table 1). Two genetically predicted lipid exposures (M.HDL.C, M.HDL.CE) were negatively associated with CHD and 64 were positively associated (Table 3). In an MVMR model including only the 66 Bonferroni-significant traits, fitted with the purpose of illustrating the instability of IVW-MVMR in conditions of severe collinearity, conditional F-statistic (CFS) (Materials and methods) was lower than 2.2 for all exposures (with a mean of 0.81), highlighting the severe weak instrument problem. In Appendix 1—figure 3, the MVMR estimates are plotted against the corresponding univariable MR (UVMR) estimates. We interpret the reduction in identified effects as a result of the drop in precision in the MVMR model (variance inflation). Only the independent causal estimate for ApoB reached our pre-defined significance threshold and was less precise (ORMVMR (95% CI): 1.031⁢(1.012,1.37), ORUVMR (95% CI): 1.013⁢(1.01,1.016) (Appendix 1—figure 4). We note that, for M.LDL.PL, the UVMR estimate (1.52⁢(1.35,1.71), p < 10-10)) had an opposite sign to the MVMR estimate (ORMVMR=0.905(0.818,1.001)). To see if the application of a weak instrument robust MVMR method could improve the analysis, we applied MR GRAPPLE (Wang et al., 2021). As the GRAPPLE pipeline suggests, the same three-sample MR design described above is employed. In the external selection GWAS study (GLGC), a total of 148 SNPs surpass the genome-wide significance level for the 97 exposures and were used as instruments. Although the method did not identify any of the exposures as significant at nominal or Bonferroni-adjusted significance level, the strongest association among all exposures is ApoB. Table 1 Univariable Mendelian randomisation (MR) results for the Kettunen dataset with coronary heart disease (CHD) as the outcome. Positive: positive causal effect on CHD risk; Negative: negative causal effect on CHD risk. PositiveNegativeVLDLAM.VLDL.C, M.VLDL.CE, M.VLDL.FC, M.VLDL.L,M.VLDL.P, M.VLDL.PL, M.VLDL.TG, XL.VLDL.L,XL.VLDL.PL, XL.VLDL.TG, XS.VLDL.L, XS.VLDL.P, XS.VLDL.PL,XS.VLDL.TG, XXL.VLDL.L, XXL.VLDL.PL,L.VLDL.C, L.VLDL.CE, L.VLDL.FC, L.VLDL.L, L.VLDL.P,L.VLDL.PL, L.VLDL.TG, SVLDL.C, S.VLDL.FC,S.VLDL.L, S.VLDL.P, S.VLDL.PL, S.VLDL.TGNoneLDLALDL.C, L.LDL.C, L.LDL.CE, L.LDL.FC, L.LDL.L, L.LDL.P, L.LDL.PL,M.LDL.C, M.LDL.CE, M.LDL.L, M.LDL.P,M.LDL.PL, S.LDL.C, S.LDL.L, S.LDL.PNoneHDLS.HDL.TG, XL.HDL.TGM.HDL.C, M.HDL.CE PCA Standard PCA with no sparsity constraints was used as a benchmark. PCA estimates a square loadings matrix of coefficients with dimension equal to the number of genetically proxied exposures K. The coefficients in the first column define the linear combination of exposures with the largest variability (PC1). Column 2 defines PC2, the linear combination of exposures with the largest variability that is also independent of PC1, and so on. This way, the resulting factors seek to reduce redundant information and project highly correlated SNP-exposure associations to the same PC. In PC1, very low-density lipoprotein (VLDL)- and low-density lipoprotein (LDL)-related traits were the major contributors (Figure 2a). ApoB received the 8th largest loading (0.1371, maximum was 0.1403 for cholesterol content in small VLDL) and LDL.C received the 48th largest (0.1147). In PC2, high-density lipoprotein (HDL)-related traits were predominant. The first 18 largest positive loadings are HDL-related and 12 describe either large or extra-large HDL traits. PC3 received its scores mainly from VLDL traits. Six components were deemed significant through the permutation-based approach (Figure 1, Materials and methods). Figure 2 Download asset Open asset Heatmaps for the loadings matrices in the Kettunen dataset for all methods (one with no sparsity constraints [a], four with sparsity constraints under different assumptions [b–e]). The number of the exposures plotted on the vertical axis is smaller than K=97 as the exposures that do not contribute to any of the sparse principal components (PCs) have been left out. Blue: positive loading; red: negative loading; yellow: zero. In the second-step IVW regression (step 4 in Figure 1), MVMR results are presented. A modest yet precise (OR = 1.002⁢(1.0015,1.0024), p<10−10) association of PC1 with CHD was observed. Conversely, PC3 was marginally significant for CHD at the 5% level (OR = 0.998 (0.998, 0.999), p=0.049). Since γ^ has been transformed with linear coefficients (visualised in loadings matrix, Figure 2), the underlying causal effects are also transformed and interpreting the magnitude of an effect estimate is not straightforward, since it reflects the effect of changing the PC by one unit on the outcome; however, significance and orientation of effects can be interpreted. When positive loadings are applied to exposures that are positively associated with the outcome, the MR estimate is positive; conversely, if negative loadings are applied, the MR estimate is negative. sPCA methods We next employed multiple sPCA methods (Table 2) that each shrink a proportion of loadings to zero. The way this is achieved differs in each method. Their underlying assumptions and details on differences in optimisation are presented in Table 2 and further described in Materials and methods. Table 2 Overview of sparse principal component analysis (sPCA) methods used. KSS: Karlis-Saporta-Spinaki criterion. Package: R package implementation; Features: short description of the method; Choice: method of selection of the number of informative components in real data; PCs: number of informative PCs. MethodPackageAuthorsFeaturesChoicePCsRSPCApcaPPCroux et al., 2013Robust sPCA (RSPCA), different measure of dispersion (Qn)Permutation KSS6SFPCACode in publication, Supplementary MaterialGuo et al., 2010Fused penalties for block correlationKSS6sPCAelasticnetZou et al., 2006Formulation of sPCA as a regression problemKSS6SCASCAChen and Rohe, 2021Rotation of eigen vectors for approximate sparsityPermutation KSS6 RSPCA (Croux et al., 2013) Optimisation and the KSS criterion pick six PCs to be informative (Karlis et al., 2003). The loadings in Figure 2 show a VLDL-, LDL-dominant PC1, with some small and medium HDL-related traits. LDL.C and ApoB received the 5th and 40th largest positive loadings. PCs 1 and 6 are positively associated with CHD and PCs 3 and 5 negatively so (Appendix 1—table 1). SFPCA (Guo et al., 2010) The KSS criterion retains six PCs. The loadings matrix (Figure 2) shows the 'fused' loadings with the identical colouring. In the two first PCs, all groups are represented. Both ApoB and LDL.C received the seventh and tenth largest loadings, together with other metabolites (Figure 2). PC1 (all groups represented) was positively associated with CHD and PC4 (negative loadings from large HDL traits) negatively so (Appendix 1—table 1). sPCA (Zou et al., 2006) The number of non-zero metabolites per PC was set at 14897∼16 (see Appendix 1—figure 6). Under this level of sparsity, the permutation-based approach suggested that six sPCs should be retained. Seventy exposures received a zero loading across all components. PC1 is constructed predominantly from LDL traits and is positively associated with CHD, but this does not retain statistical significance at the nominal level in MVMR analysis (Figure 3). Only PC4 that is comprised of small and medium HDL traits (Figure 2b) appears to exert a negative causal effect on CHD (OR (95% CI): 0.9975⁢(0.9955,0.9995)). The other PCs were not associated with CHD (all p values > 0.05, Appendix 1—table 1). Figure 3 Download asset Open asset Comparison of univariable Mendelian randomisation (UVMR) and multivariable MR (MVMR) estimates and presentation of the major group represented in each principal component (PC) per method. SCA (Chen and Rohe, 2021) Six components were retained after a permutation test. In the final model, five metabolites were regularised to zero in all PCs (CH2.DB.ratio, CH2.in.FA, FAw6, S.VLDL.C, S.VLDL.FC, Figure 2). Little overlap is noted among the metabolites. PC1 receives loadings from LDL and IDL, and PC2 from VLDL. The contribution of HDL to PCs is split in two, with large and extra-large HDL traits contributing to PC3 and small and medium ones to PC4. PC1 and PC2 were positively associated with CHD (Appendix 1—table 1, Figure 3). PC4 was negatively associated with CHD. Comparison with UVMR In principle, all PC methods derive independent components. This is strictly the case in standard PCA, where subsequent PCs are perfectly orthogonal, but is only approximately true in sparse implementations. We hypothesised that UVMR and MVMR could provide similar causal estimates of the associations of metabolite PCs with CHD. The results are presented in Figure 3 and concordance between UVMR and MVMR is quantified with the R2 from a linear regression. The largest agreement of the causal estimates is observed in PCA. In the sparse methods, SCA (Chen and Rohe, 2021) and sPCA (Zou et al., 2006) provide similarly consistent estimates, whereas some disagreement is observed in the estimate of PC6 for RSPCA (Croux et al., 2013) on CHD. A previous study implicated LDL.c and ApoB as causal for CHD (Zuber et al., 2020b). In Appendix 1—figure 7, we present the loadings for these two exposures across the PCs for the various methods. Ideally, we would like to see metabolites contributing to a small number of components for the sparse methods. Using a visualisation technique proposed by Kim and Kim, 2012, this is indeed observed (see Appendix 1—figure 7). In PCA, LDL.c and ApoB contribute to multiple PCs, whereas the sPCA methods limit this to one PC. Only in RSPCA do these exposures contribute to two PCs. In the second-step IVW meta-analysis, it appears that the PCs comprising of predominantly VLDL/LDL and HDL traits robustly associate with CHD, with differences among methods (Table 3). Table 3 Results for principal component analysis (PCA) approaches. Overlap: Percentage of metabolites receiving non-zero loadings in ≥1 component. Overlap in PC1, PC2: overlap as above but exclusively for the first two components which by definition explain the largest proportion of variance. Very low-density lipoprotein (VLDL), low-density lipoprotein (LDL), and high-density lipoprotein (HDL) significance: results of the IVW regression model with CHD as the outcome for the respective sPCs (the sPCs that mostly received loadings from these groups). The terms VLDL and LDL refer to the respective transformed blocks of correlated exposures; for instance, VLDL refers to the weighted sum of the correlated VLDL-related γ^ associations, such as VLDL phospholipid content and VLDL triglyceride content. †: RSPCA projected VLDL- and LDL-related traits to the same PC (sPC1). ‡: SCA discriminated HDL molecules in two sPCs, one for traits of small- and medium-sized molecules and one for large- and extra-large-sized. PCARSPCASFPCAsPCASCAOverlap10.93810.1870.196Overlap in PC1,PC210.43310.0100Sparse %00.4740.0820.8350.796VLDL significance in MR†YesNoYesNoYesLDL significance in MRNoYesNoNoYesHDL significance in MR‡YesYesYesNoNoSmall, medium HDL significance in MRYesNoYesYesYes Instrument strength Instrument strength for the chosen PCs was assessed via an F-statistic, calculated using a bespoke formula that accounts for the PC process (see Materials and methods and Appendix). The F-statistics for all transformed exposures cross the cutoff of 10. There was a trend for the first components being more strongly instrumented in all methods (see Appendix 1—figure 5), which is to be expected. In the MVMR analyses, the CFS for all exposures was less than three. Thus the move to PC-based analysis significantly improved instrument strength and mitigated against weak instrument bias. Simulation studies We consider the case of a data that reflects in we consider a set of exposures which can be into blocks based on groups of variants contribute exclusively to blocks of exposures, no effect on other This in to correlation among the exposure blocks and a much correlation of between exposure due only to This is visualised in Figure This data to reduce the strength in all exposures. The dataset consists of exposures, p SNPs (with and p of and a outcome, We split the simulation results into one example and one Figure 4 Download asset Open asset Simulation Data for the simulation study, with six exposures and two In the exposures that are correlated due to a genetic component are Simulation results for six exposures and three methods component analysis and Rohe, principal component analysis multivariable Mendelian randomisation The exposures that contribute to are presented in of and that do not in of In the each exposure is a In the first and the PCs that to these exposures are presented as in and are visualised as error proportion of where the is example We data under the presented in Figure with six individual exposures split into two blocks and A outcome is that is only affected by the exposures in block 1 A range of were used in the simulation in order to a range of CFS values from approximately We apply MVMR with the six individual exposures and PCA and The aim of approach is to the of the exposure into two PCs, so that the first PC has high loadings for block 1 and the PC has high loadings for block 2 Although two PCs were chosen by PCA methods using a KSS criterion in a large of to the simulation we a the number of PCs at two across all Our was to the of MVMR PCA than as the two approaches are not in this To do this we each method as a which true positive true negative positive and negative In a is an exposure that is causal in the underlying model and causal estimate is deemed In the PCA and sPCA methods, this is with to which determine each PC and if the causal estimate of this PC is are considered to be major contributors to a PC if only their individual PC loading is than the the causal effect estimate of a PC in the analysis deemed major contributors that are causal and are as and and are error therefore to the and power to the statistical were at the = = PCA, and MVMR error and power are in the three to in Figure results suggest an improved power in true causal associations with PCA and SCA with MVMR when the CFS is at the cost of an error As and CFS MVMR performs the PC of the exposures, PCA to have a error in Figure In this setting, the of PCA therefore appears to be example The aim of the simulation is to estimate the comparative performance of the methods in a that more real data We genetic data and individual level exposure and outcome data for between exposures, in The underlying data and the process of method performance is identical to the but the number of exposures, and the blocks is We results across all by and and then all methods by their under the using the approach of et al., the method performs a meta-analysis of multiple studies that and of a in order to provide a summary A model is and estimates are In our the the results of different simulation settings with of exposures and was also with high values being of Two sPCA methods and Rohe, sPCA et al., the (Figure This is mainly by an in for these methods with A at the individual simulation results the of these two methods, as high (Appendix 1—figure Both standard and Bonferroni-corrected MVMR in terms of and due to PCA with equal and results PCA and RSPCA did not identify negative results and RSPCA and This extreme result can be understood by at the individual simulation results in Appendix 1—figure PCA and RSPCA to the of the a low performance in exposures. the estimates with these methods were very precise across and this in many results and low We note a performance among the methods methods are on the results of SCA are more variable in and (Table 4). The for these methods are also the (Figure the instrument strength in γ^ from to and mean conditional F-statistic (Appendix 1—figure suggests a similar for sparse methods. Figure 5 Download asset Open asset for all methods. sparse component analysis (Chen and Rohe, 2021) sparse PCA (Zou et al., 2006) robust sparse PCA (Croux et al., PCA: principal component analysis; MVMR: multivariable Mendelian MVMR with with large MVMR can not between positive and negative exposures as robustly as the sPCA methods. A major of the of these methods appears to be the number of causal exposures, as in a simulation with only four of the exposures being there was a drop in and across all methods. sPCA methods other methods in this (Appendix 1—table 2). Table 4 and presented as and range across all as and range across all under the PCA In the example of Figure 4 and indeed any other if two PCs are PCA between causal and exposures. The only information used in this of the 2 and 3 in Figure is the association the of to PCs is genetic correlation and correlation due to than these blocks if only a of the exposures it is likely that, PCA identify the block as This the proportion of exposures within blocks of exposures that is a of To we the proportion of exposures by the sparsity of the causal effect

  • Conference Article
  • Cite Count Icon 7
  • 10.1109/cspa.2016.7515857
Analysis of sparse PCA using high dimensional data
  • Mar 1, 2016
  • Fatin Raihana On + 3 more

In this study the Sparse Principal Component Analysis (PCA) has been chosen as feature extraction and further compared with the conventional PCA technique with six UCI Machine Learning high dimensionality data as database. Results attained showed that both PCA and Sparse PCA techniques are indeed suitable as feature extraction for high dimensional data since the accuracy rate attained are higher as compared to the original data as inputs to the classifier. However, the inconsistency in determining the number of PCs to be retained is ascertained and this is the drawback of PCA technique despite its greater accuracy rate. Meanwhile, the Sparse PCA retained the original number of principal components (PCs) with sparse loadings that are mainly zero but do not produce promising result with all the datasets. The Sparse PCA technique needs to be applied to suitable high dimensional dataset to gain its fullness accuracy and efficiency.

  • Conference Article
  • Cite Count Icon 35
  • 10.1117/12.651658
Sparse principal component analysis in medical shape modeling
  • Mar 2, 2006
  • Karl Sjöstrand + 2 more

Principal component analysis (PCA) is a widely used tool in medical image analysis for data reduction, model building, and data understanding and exploration. While PCA is a holistic approach where each new variable is a linear combination of all original variables, sparse PCA (SPCA) aims at producing easily interpreted models through sparse loadings, i.e. each new variable is a linear combination of a subset of the original variables. One of the aims of using SPCA is the possible separation of the results into isolated and easily identifiable effects. This article introduces SPCA for shape analysis in medicine. Results for three different data sets are given in relation to standard PCA and sparse PCA by simple thresholding of small loadings. Focus is on a recent algorithm for computing sparse principal components, but a review of other approaches is supplied as well. The SPCA algorithm has been implemented using Matlab and is available for download. The general behavior of the algorithm is investigated, and strengths and weaknesses are discussed. The original report on the SPCA algorithm argues that the ordering of modes is not an issue. We disagree on this point and propose several approaches to establish sensible orderings. A method that orders modes by decreasing variance and maximizes the sum of variances for all modes is presented and investigated in detail.

Save Icon
Up Arrow
Open/Close
  • Ask R Discovery Star icon
  • Chat PDF Star icon

AI summaries and top papers from 250M+ research sources.