Super-sparse principal component analyses for high-throughput genomic data

Donghwan Lee,Yudi Pawitan,Woojoo Lee,Youngjo Lee

doi:10.1186/1471-2105-11-296

Donghwan Lee, Yudi Pawitan + Show 2 more

Open Access

https://doi.org/10.1186/1471-2105-11-296

Copy DOI

Abstract

BackgroundPrincipal component analysis (PCA) has gained popularity as a method for the analysis of high-dimensional genomic data. However, it is often difficult to interpret the results because the principal components are linear combinations of all variables, and the coefficients (loadings) are typically nonzero. These nonzero values also reflect poor estimation of the true vector loadings; for example, for gene expression data, biologically we expect only a portion of the genes to be expressed in any tissue, and an even smaller fraction to be involved in a particular process. Sparse PCA methods have recently been introduced for reducing the number of nonzero coefficients, but these existing methods are not satisfactory for high-dimensional data applications because they still give too many nonzero coefficients.ResultsHere we propose a new PCA method that uses two innovations to produce an extremely sparse loading vector: (i) a random-effect model on the loadings that leads to an unbounded penalty at the origin and (ii) shrinkage of the singular values obtained from the singular value decomposition of the data matrix. We develop a stable computing algorithm by modifying nonlinear iterative partial least square (NIPALS) algorithm, and illustrate the method with an analysis of the NCI cancer dataset that contains 21,225 genes.ConclusionsThe new method has better performance than several existing methods, particularly in the estimation of the loading vectors.

Highlights

Principal component analysis (PCA) has gained popularity as a method for the analysis of highdimensional genomic data
We provide some simulation studies that indicate that these sparse PCA (SPCA) methods perform better than existing ones, and illustrate their use using a cancer gene-expression dataset with 21,225 genes
We first perform small simulation studies in order to assess the performance of the proposed sparse PCA methods and compare them against other methods

Summary

Introduction

Principal component analysis (PCA) has gained popularity as a method for the analysis of highdimensional genomic data. It is often difficult to interpret the results because the principal components are linear combinations of all variables, and the coefficients (loadings) are typically nonzero These nonzero values reflect poor estimation of the true vector loadings; for example, for gene expression data, biologically we expect only a portion of the genes to be expressed in any tissue, and an even smaller fraction to be involved in a particular process. Principal component analysis (PCA) or its equivalent singular-value decomposition (SVD) is widely used for the analysis of high-dimensional data. For such gene expression data with an enormous number of variables, PCA is a useful technique for visualization, analyses and interpretation [1,2,3,4]. In this paper our focus on the PCA methodology is constrained to produce sparse loadings

Objectives

Methods

Results

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: BMC Bioinformatics	Publication Date: Jun 2, 2010
Citations: 64	License type: cc-by

R Discovery Prime

R Discovery Prime

Super-sparse principal component analyses for high-throughput genomic data

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Bioinformatics

Lead the way for us

Similar Papers

Multiple Group Testing Procedures for Analysis of High-Dimensional Genomic Data.
Hyoseok Ko ... Hokeun Sun
Genomics & informatics | VOL. 14
Hyoseok Ko, et. al.Hyoseok Ko ... Hokeun Sun
01 Jan 2015
Genomics & informatics | VOL. 14

Recursive Random Lasso (RRLasso) for Identifying Anti-Cancer Drug Targets.
Heewon Park ... Satoru Miyano
PLOS ONE | VOL. 10
Heewon Park, et. al.Heewon Park ... Satoru Miyano
06 Nov 2015
PLOS ONE | VOL. 10

Analysis of high-dimensional genomic data employing a novel bio-inspired algorithm
Santos Kumar Baliarsingh ... Sambit Bakshi
Applied Soft Computing | VOL. 77
Santos Kumar Baliarsingh, et. al.Santos Kumar Baliarsingh ... Sambit Bakshi
23 Jan 2019
Applied Soft Computing | VOL. 77

Partial least squares: a versatile tool for the analysis of high-dimensional genomic data
A.-L Boulesteix ... K Strimmer
Briefings in Bioinformatics | VOL. 8
A.-L Boulesteix, et. al.A.-L Boulesteix ... K Strimmer
26 May 2006
Briefings in Bioinformatics | VOL. 8

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Super-sparse principal component analyses for high-throughput genomic data

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Bioinformatics