Fast and covariate-adaptive method amplifies detection power in large-scale multiple\xa0hypothesis testing

Martin J Zhang,Fei Xia,James Zou

doi:10.1038/s41467-019-11247-0

Abstract

Multiple hypothesis testing is an essential component of modern data science. In many settings, in addition to the p-value, additional covariates for each hypothesis are available, e.g., functional annotation of variants in genome-wide association studies. Such information is ignored by popular multiple testing approaches such as the Benjamini-Hochberg procedure (BH). Here we introduce AdaFDR, a fast and flexible method that adaptively learns the optimal p-value threshold from covariates to significantly improve detection power. On eQTL analysis of the GTEx data, AdaFDR discovers 32% more associations than BH at the same false discovery rate. We prove that AdaFDR controls false discovery proportion and show that it makes substantially more discoveries while controlling false discovery rate (FDR) in extensive experiments. AdaFDR is computationally efficient and allows multi-dimensional covariates with both numeric and categorical values, making it broadly useful across many applications.

Highlights

Multiple hypothesis testing is an essential component of modern data science
The standard assumption of AdaFDR and other related methods is that the covariates should not affect the p-values under the null hypothesis; we prove that AdaFDR controls false discovery proportion (FDP) under such assumption
Our theory proves that AdaFDR controls FDP in the setting when the null hypotheses are independent

Summary

Introduction

In addition to the p-value, additional covariates for each hypothesis are available, e.g., functional annotation of variants in genome-wide association studies Such information is ignored by popular multiple testing approaches such as the Benjamini-Hochberg procedure (BH). In expression quantitative trait loci (eQTL) mapping or genome-wide association studies (GWAS), single nucleotide polymorphisms (SNPs) in active chromatin states are more likely to be significantly associated with the phenotype[7] Such chromatin information is readily available in public databases[8], but is not used by standard multiple hypothesis testing procedures—it is sometimes used for post hoc biological interpretation. We present AdaFDR, a fast and flexible method that adaptively learns the decision threshold from covariates to significantly improve the detection power while having the false discovery proportion (FDP) controlled at a user-specified level. AdaFDR adaptively learns such a pattern using both p-values and covariates, resulting in a covariatedependent threshold that makes more discoveries under the same FDP constraint (Fig. 1 bottom right)

Methods

Results

Conclusion