Abstract

Investigation of the genetic architecture of gene expression traits has aided interpretation of disease and trait-associated genetic variants; however, key aspects of expression quantitative trait loci (eQTL) study design and analysis remain understudied. We used extensive, empirically driven simulations to explore eQTL study design and the performance of various analysis strategies. Across multiple testing correction methods, false discoveries of genes with eQTLs (eGenes) were substantially inflated when false discovery rate (FDR) control was applied to all tests and only appropriately controlled using hierarchical procedures. All multiple testing correction procedures had low power and inflated FDR for eGenes whose causal SNPs had small allele frequencies using small sample sizes (e.g. frequency <10% in 100 samples), indicating that even moderately low frequency eQTL SNPs (eSNPs) in these studies are enriched for false discoveries. In scenarios with ≥80% power, the top eSNP was the true simulated eSNP 90% of the time, but substantially less frequently for very common eSNPs (minor allele frequencies >25%). Overestimation of eQTL effect sizes, so-called ‘Winner’s Curse’, was common in low and moderate power settings. To address this, we developed a bootstrap method (BootstrapQTL) that led to more accurate effect size estimation. These insights provide a foundation for future eQTL studies, especially those with sampling constraints and subtly different conditions.

Highlights

  • Genome-wide association studies (GWAS) have identified thousands of genetic variants associated with complex phenotypes [1] and the vast majority of genome-wide significant SNPs are located in non-coding region [2], making interpretation challenging

  • false discovery rate (FDR) and effect size estimation of expression quantitative trait loci (eQTL) studies based on different parameters, we simulated 36 scenarios with combinations of six sample sizes (N = 100, 200, 500, 1000, 2000 and 5000) and six true minor allele frequencies (MAFs) of eQTL SNPs (eSNPs) (MAF = 0.5, 1, 5, 10, 25 and 50%)

  • Each true eGene was simulated to be regulated by one cis-eQTL with a genetic effect size randomly drawn from an empirical distribution based on eQTL analysis of a real dataset [35,36]

Read more

Summary

Introduction

Genome-wide association studies (GWAS) have identified thousands of genetic variants associated with complex phenotypes [1] and the vast majority of genome-wide significant SNPs are located in non-coding region [2], making interpretation challenging. While more and more eQTLs reach statistical significance, the true proportion of false discoveries and the accuracy of their effect size estimates have not yet been well characterized. A seminal early study compared multiple testing correction methods for detecting eQTLs (including Bonferroni correction, false discovery rate (FDR) control and permutation) using HapMap data; estimates of FDR and sensitivity are not possible without knowledge of all true eQTLs in the data [8]. Genotype data have typically been simulated with a narrow minor allele frequency (MAF) range assuming Hardy–Weinberg equilibrium (e.g. MAF 30% in [9], 5 and 20% in [10], 40% in [11]), they have not captured realistic patterns of genetic variation, especially linkage disequilibrium (LD) complexity. EQTL studies have sample sizes of 50 to 1000, with the accessibility of the tissue or condition a major determining factor (Supplementary Table S1). Perhaps the exemplar multiple human tissue resource, the GenotypeTissue Expression (GTEx) project [15], comprises 44 tissues

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call