Abstract
Gene-based tests of association (e.g., variance components and burden tests) are now common practice for analyses attempting to elucidate the contribution of rare genetic variants on common disease. As sequencing datasets continue to grow in size, the number of variants within each set (e.g., gene) being tested is also continuing to grow. Pathway-based methods have been used to allow for the initial aggregation of gene-based statistical evidence and then the subsequent aggregation of evidence across the pathway. This “multi-set” approach (first gene-based test, followed by pathway-based) lacks thorough exploration in regard to evaluating genotype–phenotype associations in the age of large, sequenced datasets. In particular, we wonder whether there are statistical and biological characteristics that make the multi-set approach optimal vs. simply doing all gene-based tests? In this paper, we provide an intuitive framework for evaluating these questions and use simulated data to affirm us this intuition. A real data application is provided demonstrating how our insights manifest themselves in practice. Ultimately, we find that when initial subsets are biologically informative (e.g., tending to aggregate causal genetic variants within one or more subsets, often genes), multi-set strategies can improve statistical power, with particular gains in cases where causal variants are aggregated in subsets with less variants overall (high proportion of causal variants in the subset). However, we find that there is little advantage when the sets are non-informative (similar proportion of causal variants in the subsets). Our application to real data further demonstrates this intuition. In practice, we recommend wider use of pathway-based methods and further exploration of optimal ways of aggregating variants into subsets based on emerging biological evidence of the genetic architecture of complex disease.
Highlights
With continued dramatic growth in the amount of sequencing data available, there is a persistent interest in exploring the role that common and rare genetic variants may have in explaining the etiology of complex phenotypes
There are two broad classes of such tests: burden tests [e.g., CMC (Li and Leal, 2008)] and variance-component tests [e.g., Sequence Kernel Association Test (SKAT) (Wu et al, 2011)], with the main distinction being whether or not the test accounts for the potential beneficial impact of rare variants on disease, which is the case for variancecomponent tests, but not burden tests
In addition to the generic variance component test, we explored the behavior of a generic burden test and saw a similar behavior to what we observe in Section 3
Summary
With continued dramatic growth in the amount of sequencing data available, there is a persistent interest in exploring the role that common and rare genetic variants may have in explaining the etiology of complex phenotypes. Numerous methods of summarizing the relationships between sets of rare (and/or common) genetic variants have been proposed. There are two broad classes of such tests: burden tests [e.g., CMC (Li and Leal, 2008)] and variance-component tests [e.g., SKAT (Wu et al, 2011)], with the main distinction being whether or not the test accounts for the potential beneficial (protective) impact of rare variants on disease, which is the case for variancecomponent tests, but not burden tests. Burden tests may be more powerful when risk-impacting variants have generally similar effects, whereas variance-component tests may be more powerful when there is heterogeneity in the variant effects (Basu and Pan, 2011; Liu et al, 2013), with a new class of approaches attempting to optimally combine these two broad classes (Lee et al, 2012; Greco et al, 2016), though simulation evidence suggests that no one method is universally most powerful (Ladouceur et al, 2012; Derkach et al, 2014)
Published Version (Free)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have