Abstract

Next-generation sequencing data pose a severe curse of dimensionality, complicating traditional "single marker—single trait" analysis. We propose a two-stage combined p-value method for pathway analysis. The first stage is at the gene level, where we integrate effects within a gene using the Sequence Kernel Association Test (SKAT). The second stage is at the pathway level, where we perform a correlated Lancaster procedure to detect joint effects from multiple genes within a pathway. We show that the Lancaster procedure is optimal in Bahadur efficiency among all combined p-value methods. The Bahadur efficiency,, compares sample sizes among different statistical tests when signals become sparse in sequencing data, i.e. ε →0. The optimal Bahadur efficiency ensures that the Lancaster procedure asymptotically requires a minimal sample size to detect sparse signals (). The Lancaster procedure can also be applied to meta-analysis. Extensive empirical assessments of exome sequencing data show that the proposed method outperforms Gene Set Enrichment Analysis (GSEA). We applied the competitive Lancaster procedure to meta-analysis data generated by the Global Lipids Genetics Consortium to identify pathways significantly associated with high-density lipoprotein cholesterol, low-density lipoprotein cholesterol, triglycerides, and total cholesterol.

Highlights

  • Next-generation sequencing (NGS) technology has opened a new era for studying genetic associations with complex diseases

  • Lancaster procedure to meta-analysis data generated by the Global Lipids Genetics Consortium to identify pathways significantly associated with high-density lipoprotein cholesterol, low-density lipoprotein cholesterol, triglycerides, and total cholesterol

  • For large-scale tests, which often occur in next-generation sequencing data, the Lancaster procedure will require relatively smaller sample sizes as compared to Good’s test, i.e., NLancaster NGood when the significance level goes to 0, which represents sparse signaling in high throughput data

Read more

Summary

Introduction

Next-generation sequencing (NGS) technology has opened a new era for studying genetic associations with complex diseases. Sequencing data often contain millions of genetic variants. To maintain statistical power of detecting rare variants, a theoretical sample size of n>10,000 may be required for sequencing data [1]. These dimensional challenges motivate us to aggregate effects from multiple genes using pathway analysis. For non-Mendelian diseases and complex traits, multiple genetic risk factors may function together in the pathway. Signals may not be significant in the "single markersingle trait" analysis, but many such values from related genes might provide valuable information regarding gene function and regulation. We propose a two-stage combined p-value method for pathway (gene set) analysis of NGS data. We applied the competitive Lancaster procedure to meta-analysis data generated by the Global Lipids Genetics Consortium

Methods
3: The p-value of the competitive
Lancaster Procedure Is Optimal in Bahadur Efficiency
Findings
Discussion and Conclusions
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.