Abstract

BackgroundAlthough principal component analysis (PCA) is widely used for the dimensional reduction of biomedical data, interpretation of PCA results remains daunting. Most existing interpretation methods attempt to explain each principal component (PC) in terms of a small number of variables by generating approximate PCs with mainly zero loadings. Although useful when just a few variables dominate the population PCs, these methods can perform poorly on genomic data, where interesting biological features are frequently represented by the combined signal of functionally related sets of genes. While gene set testing methods have been widely used in supervised settings to quantify the association of groups of genes with clinical outcomes, these methods have seen only limited application for testing the enrichment of gene sets relative to sample PCs.ResultsWe describe a novel approach, principal component gene set enrichment (PCGSE), for unsupervised gene set testing relative to the sample PCs of genomic data. The PCGSE method computes the statistical association between gene sets and individual PCs using a two-stage competitive gene set test. To demonstrate the efficacy of the PCGSE method, we use simulated and real gene expression data to evaluate the performance of various gene set test statistics and significance tests.ConclusionsGene set testing is an effective approach for interpreting the PCs of high-dimensional genomic data. As shown using both simulated and real datasets, the PCGSE method can generate biologically meaningful and computationally efficient results via a two-stage, competitive parametric test that correctly accounts for inter-gene correlation.Electronic supplementary materialThe online version of this article (doi:10.1186/s13040-015-0059-z) contains supplementary material, which is available to authorized users.

Highlights

  • Principal component analysis (PCA) is widely used for the dimensional reduction of biomedical data, interpretation of principal component analysis (PCA) results remains daunting

  • We have developed principal component gene set enrichment (PCGSE), an approach for interpreting the PCs of genomic data via two-stage competitive gene set testing in which the correlation between each gene and each PC is used as a gene-level statistic with flexible choice of both the gene set test statistic and the method used to compute the null distribution of the gene set statistic

  • Evaluation using Spellman et al α factor-synchronized yeast gene expression data and yeast cell cycle gene sets The PCGSE method was used to compute the statistical association of the yeast cell cycle gene sets defined by Spellman et al [39] relative to the first three PCs of a specially processed version of the α factor-synchronized yeast gene expression data collected by Spellman et al and re-examined by Alter et al [5]. Both the α factor-synchronized data and yeast cell cycle gene sets were downloaded from the Additional file 1 website for Alter et al To support comparison against the results reported in Alter et al, PCA was performed on a version of the gene expression data that was specially processed according to the steps outlined in Alter et al so that the first three PCs were identical to the first three so-called eigengenes

Read more

Summary

Results

We describe a novel approach, principal component gene set enrichment (PCGSE), for unsupervised gene set testing relative to the sample PCs of genomic data. The PCGSE method computes the statistical association between gene sets and individual PCs using a two-stage competitive gene set test. To demonstrate the efficacy of the PCGSE method, we use simulated and real gene expression data to evaluate the performance of various gene set test statistics and significance tests

Conclusions
Background
Methods
Results and discussion
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call