Abstract

To facilitate whole-genome association studies (WGAS), several high-density SNP genotyping arrays have been developed. Genetic coverage and statistical power are the primary benchmark metrics in evaluating the performance of SNP arrays. Ideally, such evaluations would be done on a SNP set and a cohort of individuals that are both independently sampled from the original SNPs and individuals used in developing the arrays. Without utilization of an independent test set, previous estimates of genetic coverage and statistical power may be subject to an overfitting bias. Additionally, the SNP arrays' statistical power in WGAS has not been systematically assessed on real traits. One robust setting for doing so is to evaluate statistical power on thousands of traits measured from a single set of individuals. In this study, 359 newly sampled Americans of European descent were genotyped using both Affymetrix 500K (Affx500K) and Illumina 650Y (Ilmn650K) SNP arrays. From these data, we were able to obtain estimates of genetic coverage, which are robust to overfitting, by constructing an independent test set from among these genotypes and individuals. Furthermore, we collected liver tissue RNA from the participants and profiled these samples on a comprehensive gene expression microarray. The RNA levels were used as a large-scale set of quantitative traits to calibrate the relative statistical power of the commercial arrays. Our genetic coverage estimates are lower than previous reports, providing evidence that previous estimates may be inflated due to overfitting. The Ilmn650K platform showed reasonable power (50% or greater) to detect SNPs associated with quantitative traits when the signal-to-noise ratio (SNR) is greater than or equal to 0.5 and the causal SNP's minor allele frequency (MAF) is greater than or equal to 20% (N = 359). In testing each of the more than 40,000 gene expression traits for association to each of the SNPs on the Ilmn650K and Affx500K arrays, we found that the Ilmn650K yielded 15% times more discoveries than the Affx500K at the same false discovery rate (FDR) level.

Highlights

  • It has been estimated that the human genome contains more than 5 million common single nucleotide polymorphisms (SNPs) with minor allele frequencies (MAF) $10% [1,2,3], and 7.5 million common SNPs with MAF $5% [4]

  • Genetic coverage and the statistical power are two key properties to evaluate on the arrays

  • The RNA levels were used as a large-scale set of quantitative traits to calibrate the relative statistical power of the commercial arrays

Read more

Summary

Introduction

It has been estimated that the human genome contains more than 5 million common single nucleotide polymorphisms (SNPs) with minor allele frequencies (MAF) $10% [1,2,3], and 7.5 million common SNPs with MAF $5% [4] These SNPs may account for the genetic risk of many common human disorders. High-density SNP arrays have been introduced to allow researchers to conduct whole-genome association studies (WGAS) These SNP array platforms are often benchmarked by their genetic coverage and statistical power [4,5]. Genetic coverage of an array platform is defined as the fraction of common SNPs (MAF$5%) exceeding a predefined correlation threshold with at least one SNP typed by the array Statistical power in this setting measures the likelihood to detect a statistically significant association between a truly associated SNP marker and a trait

Methods
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call