Next-generation sequencing and other high-throughput technologies have made it feasible to characterize millions of sequence variations on large numbers of study participants. But when it comes to identifying a small number of these genetic features (or feature sets) that are associated with a disease trait, the investigator is faced with a formidable multiple-testing challenge. It can be thought of as a signal-tonoise problem, where the large number of unrelated genetic features tends to drown out the faint signal of the small number of biologically relevant features. The theoretical underpinnings of an emerging class of statistical methods for genomic studies, two-stage procedures for both gene-gene and gene-environment interactions have recently been described in a remarkable article (Dai et al., 2012). The key idea is that dimensionality of multiple testing in genomics can be reduced by screening features to be tested with an independent statistic in the same dataset, thereby mitigating the multipletesting problem and increasing power to detect effects. In other words, the noise is reduced, allowing the relevant signal to be more easily detected. These methods will likely gain importance as high-throughput technologies continue to yield exponentially increasing amounts of information per sample and per research dollar spent. Dai et al. couched their paper in the context of gene-environment interactions only. However, it is worth noting that the theoretical properties detailed by Dai et al. apply not just to the search for gene-environment interactions (GxE), but also to (epistatic) interactions between genetic variants (GxG), since in constructing these hypothesis tests, both “gene” and “environment” features are treated analogously as discrete or continuous variables in models designed to identify associations with a disease trait. A notable exception is when the approach depends on the environmental exposure being a randomized treatment, allowingadditional assumptions to be made. One such screening-testing interaction approach is designed for a case-control study where the investigator is interested in identifyingGxGorGxE pairs involved in interactions (Millstein et al., 2006;Murcray et al., 2009; Dai et al., 2012; Lewinger et al., 2013). There is an assumption that each pair of features considered is independent in the general population, and only if a dependence is found in the pooled case-control sample (the screening stage), is the pair tested in a formal model that includes an interaction term (the testing stage), e.g., logit(P[D]) = α+ β1∗SNP1+ β2 ∗SNP2+ β3∗SNP1∗SNP2, where β3 is the interaction parameter and D indicates disease. The interaction parameter can be testedaloneorinamultidegree-of-freedom testofoneorbothmaineffectstogetherwith the interaction, an approach that was generally found to bemore powerful (Millstein et al., 2006; Kraft et al., 2007). An important characteristic of the approach is that even if the independence assumption is not justified, type I error in the testing stage will still be properly controlled. This approach is perhaps more general and more powerful than previously appreciated. The screening procedure appears to be sensitive to both main effects and interactions, not just interactions, as claimed in prior work. The implication is that the approach is less specific to interactions and correspondingly more powerful when main effects are present. In fact, it may be capable of detecting weak interactions coupled with weak main effects. Some authors (Murcray et al., 2009; Dai et al., 2012; Lewinger et al., 2013) have attributed the statistical power of the screening procedure solely to an association in cases due to an interaction in the underlying population (non-zero β3, or more correctly, a departure from multiplicativity on a relative risk scale), as in the case-only interaction analysis (Piegorsch et al., 1994). According to this view, controls only contribute noise to the screening procedure because the factors are independent in this population. Further, if the two features contribute marginal disease risks and a multiplicative relative risk model describes their joint risk, then dependencies will not be induced among cases. The idea is that if there is independence in cases and independence in controls, then it should follow that there would be independence in the pooled case-control sample—but this is not necessarily the case. It has not been adequately appreciated that when cases and controls are pooled, main effects can contribute a substantial increase in power to capture disease-related feature pairs with the above screening procedure. Interestingly, the complex conditioning on disease status inherent in pooling of cases and controls can induce dependencies and thus increase power of the screening procedure when main effects are present. As proof of concept, consider the relatively simple relative risk model, log(P[D]) = λ+ β1∗SNP1+ β2∗SNP2+ β3 ∗SNP1∗SNP2, where exp (λ) is the baseline risk, the two SNPs have equal relative risks per allele, i.e., β1 = β2, there is a weak interaction (small β3), and equal
Read full abstract