Sparse outlier-robust PCA for multi-source data
Abstract Sparse and outlier-robust principal component analysis (PCA) has been a very active field of research recently. Yet, most existing methods apply PCA to a single data set whereas multi-source data—i.e. multiple related data sets requiring joint analysis—arise across many scientific areas. We introduce a novel PCA methodology that simultaneously (i) selects important features, (ii) allows for the detection of global sparse patterns across multiple data sources as well as local source-specific patterns, and (iii) is resistant to outliers. To this end, we develop a regularization problem with a penalty that accommodates global-local structured sparsity patterns, and where an outlier-robust covariance estimator, namely the ssMRCD, is used as plug-in to permit joint, robust analysis across multiple data sources. We provide an efficient implementation of our proposal via the alternating direction method of multipliers and illustrate its practical advantages in simulations and in applications.
- Peer Review Report
- 10.7554/elife.80063.sa2
- Nov 28, 2022
Author response: Sparse dimensionality reduction approaches in Mendelian randomisation with highly correlated exposures
- Research Article
2
- 10.1007/s00362-018-1045-6
- Sep 22, 2018
- Statistical Papers
There is currently much discussion about the analysis of multiple datasets from different groups, among which especially identifying a common basic structure of multiple groups has drawn a large amount of attention. In order to identify a common basic structure, common component analysis (CCA) was proposed by generalizing techniques for principal component analysis (PCA); i.e., CCA becomes standard PCA when applied to only one dataset. Although CCA can identify the common structure of multiple datasets, which cannot be extracted by standard PCA, CCA suffers from the following drawbacks. The common components are estimated as linear combinations of all variables, and thus it is difficult to interpret the identified common components. The fully dense loadings lead to erroneous results in CCA, because noisy features are inevitably included in datasets. To address these issues, we incorporate sparsity into CCA, and propose a novel strategy for sparse common component analysis based on $$L_{1}$$ -type regularized regression modeling. We focus CCA which is formulated as the eigenvalue decomposition (EVD) of a Gram matrix (i.e., common loadings of multiple datasets can be estimated by EVD of a Gram matrix), and it can be performed by Singular value decomposition of a square root of the Gram matrix. We then propose sparse common component analysis based on sparse PCA to estimate sparse common loadings of multiple datasets. We also propose an algorithm to estimate sparse common loadings of multiple datasets. The proposed method can not only identify a common subspace but also select crucial common-features for multiple groups. Monte Carlo simulations and real-data analysis are conducted to examine the efficiency of the proposed sparse CCA. We observe from the numerical studies that our strategies can incorporate sparsity into the common loading estimation and efficiently recover a sparse common structure efficiently in multiple dataset analysis.
- Research Article
7
- 10.1057/s41260-022-00264-2
- Apr 24, 2022
- Journal of Asset Management
In this paper, we investigate characteristic differences between Socially Responsible Investment (SRI) funds and conventional funds across 35 different categories, including previously unexplored areas, such as fund manager skills and investment strategies. Further, we examine SRI and conventional funds globally rather than from just one country (e.g., US) or one region (e.g., Europe), covering funds listed in 22 different countries. We also adopt a new Principal Component Analysis (PCA) methodology for matching SRI funds against their conventional counterparts that significantly increases the sample size from previous studies, reducing selection bias and possibly explaining contradictory findings in the prior literature. Contributing to the literature, our findings show that: (i) SRI funds have more diversified portfolios than conventional funds; (ii) SRI funds have lower cash holdings while investing more in US equities; and (iii) SRI fund managers charge a smaller fee and are more successful in managing their portfolios. This is reassuring for investors who invest in SRI funds and for the future health and sustainability of the planet.
- Research Article
9
- 10.1016/j.psep.2023.04.036
- Apr 21, 2023
- Process Safety and Environmental Protection
An enhanced temporal algorithm- coupled optimized adaptive sparse principal component analysis methodology for fault diagnosis of chemical processes
- Research Article
44
- 10.1214/18-aoas1146
- Dec 1, 2018
- The Annals of Applied Statistics
Many applications involve large datasets with entries from exponential family distributions. Our main motivating application is photon-limited imaging, where we observe images with Poisson distributed pixels. We focus on X-ray Free Electron Lasers (XFEL), a quickly developing technology whose goal is to reconstruct molecular structure. In XFEL, estimating the principal components of the noiseless distribution is needed for denoising and for structure determination. However, the standard method, Principal Component Analysis (PCA), can be inefficient in non-Gaussian noise. Motivated by this application, we develop $e$PCA (exponential family PCA), a new methodology for PCA on exponential families. $e$PCA is a fast method that can be used very generally for dimension reduction and denoising of large data matrices with exponential family entries. We conduct a substantive XFEL data analysis using $e$PCA. We show that $e$PCA estimates the PCs of the distribution of images more accurately than PCA and alternatives. Importantly, it also leads to better denoising. We also provide theoretical justification for our estimator, including the convergence rate and the Marchenko–Pastur law in high dimensions. An open-source implementation is available.
- Research Article
- 10.1118/1.4734628
- Jun 1, 2012
- Medical Physics
Purpose: The purpose of this study is to use an Iterative Principal Component Analysis (PCA) methodology to enhance lung tumor motion relative to stationary surrounding anatomy in order to improve tumor localization and allow efficient lung tumor tracking. Methods: A digital thorax phantom containing several ellipses of different sizes and densities was used to simulate tumor, lungs and vertebral column. The CBCT acquisition numerically generated 700 projections in 2min for 360deg. The projections were simulated using line‐integrals of phantom density with parallel‐beam geometry. The rigid anatomy perspective change remains minimal within 5–6deg while a typical tumor is moving approximately half breathing cycle. A set of 10–12 projection images is generated within 5–6deg arc and used for PCA analysis. PCA transformed the axes of the image set to extract uncorrelated dominant features. For a given set of 10–12 projections, the middle image was filtered using PCA where only a few principal components were considered; satisfying user defined cut‐off threshold. The method is applied to all 700 projections based on a moving‐angle window. The eigenvectors selection criterion allowed different number of components to reconstruct PCA‐filtered images depending on variation among data set. Proposed methodology was evaluated using simple sin, sin6 and complex patient motion profiles. Results: The PCA coefficient cut‐off value of 10% and 20% recovered the amplitude, period and phase of phantom motion within 5% error. These cut‐off values also enhanced lung tumor visibility of PCA‐filtered images. The methodology was also implemented on prerecorded patient CBCT projections. The patient study was evaluated for 10% and 20% cut‐off values. PCA coefficient with 20% cut‐off value provided superior contrast. Conclusion: The iterative principal component analysis is a robust method to emphasize variation among CBCT projection images when rigid anatomy remains relatively stationary. The proposed methodology has shown promising results on the patient data.
- Book Chapter
34
- 10.1007/978-3-319-18032-8_38
- Jan 1, 2015
With advances in data collection technologies, multiple data sources are assuming increasing prominence in many applications. Clustering from multiple data sources has emerged as a topic of critical significance in the data mining and machine learning community. Different data sources provide different levels of necessarily detailed knowledge. Thus, combining multiple data sources is pivotal to facilitate the clustering process. However, in reality, the data usually exhibits heterogeneity and incompleteness. The key challenge is how to effectively integrate information from multiple heterogeneous sources in the presence of missing data. Conventional methods mainly focus on clustering heterogeneous data with full information in all sources or at least one source without missing values. In this paper, we propose a more general framework T-MIC (Tensor based Multi-source Incomplete data Clustering) to integrate multiple incomplete data sources. Specifically, we first use the kernel matrices to form an initial tensor across all the multiple sources. Then we formulate a joint tensor factorization process with the sparsity constraint and use it to iteratively push the initial tensor towards a quality-driven exploration of the latent factors by taking into account missing data uncertainty. Finally, these factors serve as features to clustering. Extensive experiments on both synthetic and real datasets demonstrate that our proposed approach can effectively boost clustering performance, even with large amounts of missing data.
- Conference Article
23
- 10.1145/1401890.1401990
- Aug 24, 2008
Selection of genes that are differentially expressed and critical to a particular biological process has been a major challenge in post-array analysis. Recent development in bioinformatics has made various data sources available such as mRNA and miRNA expression profiles, biological pathway and gene annotation, etc. Efficient and effective integration of multiple data sources helps enrich our knowledge about the involved samples and genes for selecting genes bearing significant biological relevance. In this work, we studied a novel problem of multi-source gene selection: given multiple heterogeneous data sources (or data sets), select genes from expression profiles by integrating information from various data sources. We investigated how to effectively employ information contained in multiple data sources to extract an intrinsic global geometric pattern and use it in covariance analysis for gene selection. We designed and conducted experiments to systematically compare the proposed approach with representative methods in terms of statistical and biological significance, and showed the efficacy and potential of the proposed approach with promising findings.
- Research Article
4
- 10.1360/cjcp2006.19(2).143.6
- Aug 1, 2006
- Chinese Journal of Chemical Physics
Density functional theory (DFT) was used to calculate molecular descriptors (properties) for 12 fluoroquinolone with anti-S.pneumoniae activity. Principal component analysis (PCA) and hierarchical cluster analysis (HCA) were employed to reduce dimensionality and investigate in which variables should be more effective for classifying fluoroquinolones according to their degree of an-S.pneumoniae activity. The PCA results showed that the variables ELUMO, Q3, Q5, QA, logP, MR, VOL and EHL of these compounds were responsible for the anti-S.pneumoniae activity. The HCA results were similar to those obtained with PCA. The methodologies of PCA and HCA provide a reliable rule for classifying new fluoroquinolones with anti-S.pneumoniae activity. By using the chemometric results, 6 synthetic compounds were analyzed through the PCA and HCA and two of them are proposed as active molecules with anti-S.pneumoniae, which is consistent with the results of clinic experiments.
- Research Article
4
- 10.1007/s11458-009-0102-z
- Jan 9, 2010
- Frontiers of Chemistry in China
Density functional theory (DFT) was used to calculate the properties of a set of molecular descriptors for 14 fluoroquinolone with anti-Pseudomonas aeruginosa activity. Principal component analysis (PCA) and hierarchical cluster analysis (HCA) were employed in order to reduce dimensionality and investigate the effectiveness of variables, i. e., which subset of variables should be more effective for classifying fluoroquinolones according to their antibacterial activities against P. aeruginosa. The PCA results showed that the variables E LUMO, ΔE HL, Q 5, Q 6, logP, MR, and MP are responsible for the separation between compounds with higher and lower anti-P. aeruginosa activity. The HCA results were similar to those obtained using PCA. By using the chemometric results, four synthetic compounds were analyzed through the PCA and HCA. Two of them are proposed as active molecules against P. aeruginosa. The result is consistent with the observations of clinic experiments. The methodologies of PCA and HCA provide a reliable rule for classifying new fluoroquinolones with anti-P. aeruginosa activity.
- Research Article
27
- 10.1109/tase.2022.3144288
- Jan 1, 2023
- IEEE Transactions on Automation Science and Engineering
This paper introduces a novel sparse dynamic inner principal component analysis (SDiPCA) based monitoring for multimode dynamic processes. Different from traditional multimode monitoring algorithms, a model is updated for sequential modes by memorizing the significant features of existing modes. By adopting the concept of intelligent synapses in continual learning, a loss of quadratic term is introduced to penalize the changes of mode–relevant parameters, where modified synaptic intelligence (MSI) is proposed to estimate the parameter importance. Thus, the proposed algorithm is referred to as SDiPCA–MSI. When a new mode arrives, a set of normal samples should be collected. The previous significant features are consolidated without explicitly storing training samples, while extracting new information from the current mode. Consequently, SDiPCA–MSI can provide outstanding performance for successive modes. Characteristics of the proposed approach are discussed, including the computational complexity, advantages and potential limitations. Compared with several state-of-the-art monitoring methods, the effectiveness and superiorities of the proposed method are demonstrated by a continuous stirred tank heater case and a practical industrial system. Note to Practitioners—Multimode process monitoring is increasingly significant as industrial systems generally operate in varying operating conditions. However, most researches focus on multiple local monitoring models for complex multimode processes and assume that data of all possible modes are available and stored before learning. When similar or new modes arrive, local models are rebuilt corresponding to each mode and the model’s capacity would increase with the continuous emergence of modes. Adaptive methods are a branch of multimode monitoring algorithms, but they strive to extract information of the current mode to ensure the monitoring performance while forgetting the previously learned knowledge gradually. This paper proposes a novel sparse dynamic inner principal component analysis with continual learning ability for multimode dynamic process monitoring, where modified synaptic intelligence is developed to measure the parameter importance accurately. It requires limited computation and storage resources for successive modes, which is convenient for practical applications. Similar to current multimode process monitoring algorithms, a set of data should be collected before learning a new mode, which may bring difficulties to real–time monitoring. For industrial systems, such as large–scale power plants and chemical systems, the proposed method has outstanding ability to monitor successive dynamic modes.
- Research Article
2
- 10.1360/cjcp2007.20(2).167.6
- Apr 1, 2007
- Chinese Journal of Chemical Physics
The structure-activity relationship of fluoroquinolones, which show anti-K. pneumoniae activity, was studied by using principal component analysis (PCA) and hierarchical cluster analysis (HCA). The PCA results showed that the lowest unoccupied molecular orbital energy, energy difference between the highest occupied and the lowest unoccupied molecular orbital, dipole moment, net atomic charge on atom I, molecular polarizability, partition coefficient and molecular refractivity of these compounds are responsible for the separation between high-activity and low-activity groups. The HCA results were similar to those obtained with PCA. By using the chemometric results, four synthetic compounds were analyzed through PCA and HCA, and three of them are proposed as active molecules against K. pneumoniae which is consistent with the results of clinical experiments. The methodologies of PCA and HCA provide a reliable rule for classifying new fluoroquinolones with anti-K. pneumoniae activity.
- Research Article
- 10.30906/0023-1134-2007-41-2-23-28
- Jan 1, 2007
Quantitative structure - pharmacokinetic/pharmacodynamic (PK/PD) relationship (QSPR) techniques and chemometric methods were employed to classify fluoroquinolones with respect to their activity against Streptococcus pneumoniae. Density functional theory (DFT) was used to calculate a set of molecular descriptors (properties) for 13 synthetic fluoroquinolones. The descriptors were further analyzed using chemometric methods including the principal component analysis (PCA), hierarchical cluster analysis (HCA), and stepwise discriminant analysis (SDA). The PCA and SDA methods were employed in order to reduce the dimensionality and select a subset of variables that would be more effective for classifying the fluoroquinolones according to their degree of antipneumococcal activity. The methods of PCA, SDA and HCA were quite efficient to classify 13 compounds in two groups (active and inactive), and the net charge on ring B (QB), molecular volume (VOL), and partition coefficient (log P) were found to be descriptors important for the classification. These methodologies of PCA, SDA and HCA provide a reliable rule for classifying new fluoroquinolones with respect to antipneumococcal activity. The application of SPP relationship is of considerable value for clinicians, drug developers, and regulators because PK/PD principles form the basis of modern antimicrobial chemotherapy.
- Research Article
3
- 10.1186/s12859-022-04770-3
- Jun 17, 2022
- BMC Bioinformatics
BackgroundPan-omics, pan-cancer analysis has advanced our understanding of the molecular heterogeneity of cancer. However, such analyses have been limited in their ability to use information from multiple sources of data (e.g., omics platforms) and multiple sample sets (e.g., cancer types) to predict clinical outcomes. We address the issue of prediction across multiple high-dimensional sources of data and sample sets by using molecular patterns identified by BIDIFAC+, a method for integrative dimension reduction of bidimensionally-linked matrices, in a Bayesian hierarchical model. Our model performs variable selection through spike-and-slab priors that borrow information across clustered data. We use this model to predict overall patient survival from the Cancer Genome Atlas with data from 29 cancer types and 4 omics sources and use simulations to characterize the performance of the hierarchical spike-and-slab prior.ResultsWe found that molecular patterns shared across all or most cancers were largely not predictive of survival. However, our model selected patterns unique to subsets of cancers that differentiate clinical tumor subtypes with markedly different survival outcomes. Some of these subtypes were previously established, such as subtypes of uterine corpus endometrial carcinoma, while others may be novel, such as subtypes within a set of kidney carcinomas. Through simulations, we found that the hierarchical spike-and-slab prior performs best in terms of variable selection accuracy and predictive power when borrowing information is advantageous, but also offers competitive performance when it is not.ConclusionsWe address the issue of prediction across multiple sources of data by using results from BIDIFAC+ in a Bayesian hierarchical model for overall patient survival. By incorporating spike-and-slab priors that borrow information across cancers, we identified molecular patterns that distinguish clinical tumor subtypes within a single cancer and within a group of cancers. We also corroborate the flexibility and performance of using spike-and-slab priors as a Bayesian variable selection approach.
- Research Article
4
- 10.1109/tip.2020.2988139
- Jan 1, 2020
- IEEE Transactions on Image Processing
In photon-limited imaging, the pixel intensities are affected by photon count noise. Many applications require an accurate estimation of the covariance of the underlying 2-D clean images. For example, in X-ray free electron laser (XFEL) single molecule imaging, the covariance matrix of 2-D diffraction images is used to reconstruct the 3-D molecular structure. Accurate estimation of the covariance from low-photon-count images must take into account that pixel intensities are Poisson distributed, hence the classical sample covariance estimator is highly biased. Moreover, in single molecule imaging, including in-plane rotated copies of all images could further improve the accuracy of covariance estimation. In this paper we introduce an efficient and accurate algorithm for covariance matrix estimation of count noise 2-D images, including their uniform planar rotations and possibly reflections. Our procedure, steerable ePCA, combines in a novel way two recently introduced innovations. The first is a methodology for principal component analysis (PCA) for Poisson distributions, and more generally, exponential family distributions, called ePCA. The second is steerable PCA, a fast and accurate procedure for including all planar rotations when performing PCA. The resulting principal components are invariant to the rotation and reflection of the input images. We demonstrate the efficiency and accuracy of steerable ePCA in numerical experiments involving simulated XFEL datasets and rotated face images from Yale Face Database B.
- Ask R Discovery
- Chat PDF
AI summaries and top papers from 250M+ research sources.