Abstract
BackgroundIn genome-wide association studies (GWAS), the number of single-nucleotide polymorphisms (SNPs) typically ranges between 500,000 and 1,000,000. Accordingly, detecting gene-gene interactions in GWAS is computationally challenging because it involves hundreds of billions of SNP pairs. Stage-wise strategies are often used to overcome the computational difficulty. In the first stage, fast screening methods (e.g. Tuning ReliefF) are applied to reduce the whole SNP set to a small subset. In the second stage, sophisticated modeling methods (e.g., multifactor-dimensionality reduction (MDR)) are applied to the subset of SNPs to identify interesting interaction models and the corresponding interaction patterns. In the third stage, the significance of the identified interaction patterns is evaluated by hypothesis testing.ResultsIn this paper, we show that this stage-wise strategy could be problematic in controlling the false positive rate if the null distribution is not appropriately chosen. This is because screening and modeling may change the null distribution used in hypothesis testing. In our simulation study, we use some popular screening methods and the popular modeling method MDR as examples to show the effect of the inappropriate choice of null distributions. To choose appropriate null distributions, we suggest to use the permutation test or testing on the independent data set. We demonstrate their performance using synthetic data and a real genome wide data set from an Aged-related Macular Degeneration (AMD) study.ConclusionsThe permutation test or testing on the independent data set can help choosing appropriate null distributions in hypothesis testing, which provides more reliable results in practice.
Highlights
In genome-wide association studies (GWAS), the number of single-nucleotide polymorphisms (SNPs) typically ranges between 500,000 and 1,000,000
Through simulation experiments and the experiment on a real genome-wide data set from an Aged-related Macular Degeneration (AMD) study, we demonstrate that the appropriate choice of null distributions leads to more reliable results
The exhaustive search of all pairwise interactions and further using cross-validation to evaluate them (e.g. MDR [6]) become impractical in GWAS. To make it computationally feasible, a screening method is applied to the whole data set to pre-select a small subset of SNPs
Summary
In genome-wide association studies (GWAS), the number of single-nucleotide polymorphisms (SNPs) typically ranges between 500,000 and 1,000,000. In the second stage, sophisticated modeling methods (e.g., multifactor-dimensionality reduction (MDR)) are applied to the subset of SNPs to identify interesting interaction models and the corresponding interaction patterns. Many singlelocus based methods [2] have been proposed and many susceptibility determinants have been identified [1] These identified SNPs seem to be insufficient issue of applying most of these methods in GWAS is the computational burden [4]. To find pairwise interactions from 500,000 SNPs, we need 1.25 × 1011 statistical tests in total. To address this issue, screening approaches [20] have been proposed. The whole process of detecting gene-gene interactions is divided into three stages:
Published Version (Free)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have