Applying stability selection to consistently estimate sparse principal components in high-dimensional molecular data.

Martin Sill,Maral Saadati,Axel Benner

doi:10.1093/bioinformatics/btv197

Martin Sill, Maral Saadati + Show 1 more

Open Access

https://doi.org/10.1093/bioinformatics/btv197

Copy DOI

Journal: Bioinformatics	Publication Date: Apr 10, 2015
Citations: 24	License type: CC BY 4.0

Affiliation: DKFZ-ZMBH Alliance, Heidelberg University

Abstract

Motivation: Principal component analysis (PCA) is a basic tool often used in bioinformatics for visualization and dimension reduction. However, it is known that PCA may not consistently estimate the true direction of maximal variability in high-dimensional, low sample size settings, which are typical for molecular data. Assuming that the underlying signal is sparse, i.e. that only a fraction of features contribute to a principal component (PC), this estimation consistency can be retained. Most existing sparse PCA methods use L1-penalization, i.e. the lasso, to perform feature selection. But, the lasso is known to lack variable selection consistency in high dimensions and therefore a subsequent interpretation of selected features can give misleading results.Results: We present S4VDPCA, a sparse PCA method that incorporates a subsampling approach, namely stability selection. S4VDPCA can consistently select the truly relevant variables contributing to a sparse PC while also consistently estimate the direction of maximal variability. The performance of the S4VDPCA is assessed in a simulation study and compared to other PCA approaches, as well as to a hypothetical oracle PCA that ‘knows’ the truly relevant features in advance and thus finds optimal, unbiased sparse PCs. S4VDPCA is computationally efficient and performs best in simulations regarding parameter estimation consistency and feature selection consistency. Furthermore, S4VDPCA is applied to a publicly available gene expression data set of medulloblastoma brain tumors. Features contributing to the first two estimated sparse PCs represent genes significantly over-represented in pathways typically deregulated between molecular subgroups of medulloblastoma.Availability and implementation: Software is available at https://github.com/mwsill/s4vdpca.Contact: m.sill@dkfz.deSupplementary information: Supplementary data are available at Bioinformatics online.

Highlights

Principal component analysis (PCA) is the most popular method for dimension reduction and visualization that is widely used for the analysis of high-dimensional molecular data
Even though the lasso does not fulfill the oracle property and can not achieve model selection consistency in highdimensional data, it selects the truly relevant variables with high probability (Benner et al, 2010). To utilize this property we propose to apply stability selection (Meinshausen and Buhlmann, 2010) to the lasso estimator involved in the regularized sparse PCA method (RSPCA) algorithm
The results of the simulation study comparing the S4VDPCA to RSPCA using different penalization functions, conventional PCA and oracle PCA are shown in Figures 1 and 2

Summary

Introduction

Principal component analysis (PCA) is the most popular method for dimension reduction and visualization that is widely used for the analysis of high-dimensional molecular data. PCA aims to project a high-dimensional data matrix into a lower dimensional space by seeking linear combinations of the original variables, called principal components (PCs). By construction, these PCs capture maximal variance and are orthogonal.

Methods

Results

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Applying stability selection to consistently estimate sparse principal components in high-dimensional molecular data.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Bioinformatics

Lead the way for us

Similar Papers

Sparse principal component analysis via regularized low rank matrix approximation
Haipeng Shen ... Jianhua Z Huang
Journal of Multivariate Analysis | VOL. 99
Haipeng Shen, et. al.Haipeng Shen ... Jianhua Z Huang
27 Jun 2007
Journal of Multivariate Analysis | VOL. 99

Author response: Sparse dimensionality reduction approaches in Mendelian randomisation with highly correlated exposures
Vasileios Karageorgiou ... Verena Zuber
-
Vasileios Karageorgiou, et. al.Vasileios Karageorgiou ... Verena Zuber
28 Nov 2022
28 Nov 2022

Improve robustness of sparse PCA by L1-norm maximization
Deyu Meng ... Qian Zhao
Pattern Recognition | VOL. 45
Deyu Meng, et. al.Deyu Meng ... Qian Zhao
19 Jul 2011
Pattern Recognition | VOL. 45

Exactly Uncorrelated Sparse Principal Component Analysis
Oh-Ran Kwon ... Hui Zou
Journal of Computational and Graphical Statistics | VOL. ahead-of-print
Oh-Ran Kwon, et. al.Oh-Ran Kwon ... Hui Zou
28 Aug 2023
Journal of Computational and Graphical Statistics | VOL. ahead-of-print

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Applying stability selection to consistently estimate sparse principal components in high-dimensional molecular data.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Bioinformatics