Abstract

Detecting associations between an input gene set and annotated gene sets (e.g., pathways) is an important problem in modern molecular biology. In this paper, we propose two algorithms, termed NetPEA and NetPEA’, for conducting network-based pathway enrichment analysis. Our algorithms consider not only shared genes but also gene–gene interactions. Both algorithms utilize a protein–protein interaction network and a random walk with a restart procedure to identify hidden relationships between an input gene set and pathways, but both use different randomization strategies to evaluate statistical significance and as a result emphasize different pathway properties. Compared to an over representation-based method, our algorithms can identify more statistically significant pathways. Compared to an existing network-based algorithm, EnrichNet, our algorithms have a higher sensitivity in revealing the true causal pathways while at the same time achieving a higher specificity. A literature review of selected results indicates that some of the novel pathways reported by our algorithms are biologically relevant and important. While the evaluations are performed only with KEGG pathways, we believe the algorithms can be valuable for general functional discovery from high-throughput experiments.

Highlights

  • Modern molecular biology has been revolutionized with the emergence of high-throughput experimental technologies such as microarrays and next-generation DNA sequencing

  • We propose two novel network-based algorithms to analyze functional associations between input gene sets and annotated gene sets (e.g., Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways)

  • The two algorithms apply different randomization strategies to evaluate the statistical significance of the associations and often return complementary results

Read more

Summary

Introduction

Modern molecular biology has been revolutionized with the emergence of high-throughput experimental technologies such as microarrays and next-generation DNA sequencing. A typical output from such a high-throughput experiment is a list of genes that are observed to be associated with a certain phenotype, such as those differentially expressed in tumors compared to normal tissues. A principled way to interpret such gene lists is to compare them with a database of well-annotated gene sets, such as biological pathways. One of the most widely used approach, Over Representation Analysis (ORA) [1], counts the number of common genes shared by an input gene set and each annotated gene set and applies a statistical test, such as the cumulative hyper-geometric test, to calculate the statistical significance of the overlap. A p-value cutoff, e.g., 0.05, is applied to select the annotated gene sets that share a statistically significant overlap

Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.