Abstract

BackgroundGenome-wide association studies (GWAS) aim at finding genetic markers that are significantly associated with a phenotype of interest. Single nucleotide polymorphism (SNP) data from the entire genome are collected for many thousands of SNP markers, leading to high-dimensional regression problems where the number of predictors greatly exceeds the number of observations. Moreover, these predictors are statistically dependent, in particular due to linkage disequilibrium (LD).We propose a three-step approach that explicitly takes advantage of the grouping structure induced by LD in order to identify common variants which may have been missed by single marker analyses (SMA). In the first step, we perform a hierarchical clustering of SNPs with an adjacency constraint using LD as a similarity measure. In the second step, we apply a model selection approach to the obtained hierarchy in order to define LD blocks. Finally, we perform Group Lasso regression on the inferred LD blocks. We investigate the efficiency of this approach compared to state-of-the art regression methods: haplotype association tests, SMA, and Lasso and Elastic-Net regressions.ResultsOur results on simulated data show that the proposed method performs better than state-of-the-art approaches as soon as the number of causal SNPs within an LD block exceeds 2. Our results on semi-simulated data and a previously published HIV data set illustrate the relevance of the proposed method and its robustness to a real LD structure. The method is implemented in the R package BALD (Blockwise Approach using Linkage Disequilibrium), available from http://www.math-evry.cnrs.fr/publications/logiciels.ConclusionsOur results show that the proposed method is efficient not only at the level of LD blocks by inferring well the underlying block structure but also at the level of individual SNPs. Thus, this study demonstrates the importance of tailored integration of biological knowledge in high-dimensional genomic studies such as GWAS.Electronic supplementary materialThe online version of this article (doi:10.1186/s12859-015-0556-6) contains supplementary material, which is available to authorized users.

Highlights

  • The first 20 Single nucleotide polymorphism (SNP) selected by the Lasso are the same as those selected by the univariate model except for 3 SNPs; the names of these 3 SNPs are marked with blue dashes (-) in the left panel of Figure 9

  • In this paper, we have proposed a three-step approach that takes into account the biological information of the linkage disequilibrium between variables by firstly inferring LD blocks, estimating the number of such blocks, and performing Group Lasso regression on these inferred groups

  • State-of-the-art single marker analyses (SMA) and penalized regression approaches Lasso and Elastic-Net are outperformed by our proposed method for the purpose of identifying blocks containing causal SNPs

Read more

Summary

Introduction

Genome-wide association studies (GWAS) aim at finding genetic markers that are significantly associated with a phenotype of interest. Single nucleotide polymorphism (SNP) data from the entire genome are collected for many thousands of SNP markers, leading to high-dimensional regression problems where the number of predictors greatly exceeds the number of observations. These predictors are statistically dependent, in particular due to linkage disequilibrium (LD). With recent advances in high-throughput genotyping technology, genome-wide association studies (GWAS) have become a tool of choice for identifying genetic markers underlying a variation in a given phenotype – typically complex human diseases and traits. The most widely used approach for selecting causal SNPs is to perform univariate tests of association between the phenotype of interest and the genotype of each marker [2,3].

Objectives
Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call