Exploration of a diversity of computational and statistical measures of association for genome-wide genetic studies

Elisabetta Manduchi,Jason H Moore,Marylyn D Ritchie,Patryk R Orzechowski

doi:10.1186/s13040-019-0201-4

Elisabetta Manduchi, Jason H Moore + Show 2 more

Open Access

https://doi.org/10.1186/s13040-019-0201-4

Copy DOI

Journal: BioData mining	Publication Date: Jul 9, 2019
Citations: 3	License type: open-access

Affiliation: University of Pennsylvania

Abstract

BackgroundThe principal line of investigation in Genome Wide Association Studies (GWAS) is the identification of main effects, that is individual Single Nucleotide Polymorphisms (SNPs) which are associated with the trait of interest, independent of other factors. A variety of methods have been proposed to this end, mostly statistical in nature and differing in assumptions and type of model employed. Moreover, for a given model, there may be multiple choices for the SNP genotype encoding. As an alternative to statistical methods, machine learning methods are often applicable. Typically, for a given GWAS, a single approach is selected and utilized to identify potential SNPs of interest. Even when multiple GWAS are combined through meta-analyses within a consortium, each GWAS is typically analyzed with a single approach and the resulting summary statistics are then utilized in meta-analyses.ResultsIn this work we use as case studies a Type 2 Diabetes (T2D) and a breast cancer GWAS to explore a diversity of applicable approaches spanning different methods and encoding choices. We assess similarity of these approaches based on the derived ranked lists of SNPs and, for each GWAS, we identify a subset of representative approaches that we use as an ensemble to derive a union list of top SNPs. Among these are SNPs which are identified by multiple approaches as well as several SNPs identified by only one or a few of the less frequently used approaches. The latter include SNPs from established loci and SNPs which have other supporting lines of evidence in terms of their potential relevance to the traits.ConclusionsNot every main effect analysis method is suitable for every GWAS, but for each GWAS there are typically multiple applicable methods and encoding options. We suggest a workflow for a single GWAS, extensible to multiple GWAS from consortia, where representative approaches are selected among a pool of suitable options, to yield a more comprehensive set of SNPs, potentially including SNPs that would typically be missed with the most popular analyses, but that could provide additional valuable insights for follow-up.

Highlights

Genome Wide Association Studies (GWAS) have yielded many valuable insights into the genetic bases of common diseases and complex traits
In this work we focus on binary trait GWAS and, using as case studies a Type 2 Diabetes (T2D) and a breast cancer GWAS, we explore 25 applicable approaches spanning different methods, and their software implementations, as well as encoding choices
We identify a representative collection of approaches that we use as an ensemble to derive a union list of top Single nucleotide polymorphism (SNP) for each GWAS

Summary

Introduction

GWAS have yielded many valuable insights into the genetic bases of common diseases and complex traits. Several statistical methods have been proposed for univariate GWAS analyses, as reviewed in [6]. Some of these require a choice of genotype encoding and the prevailing choice is additive encoding, where the genotype of an individual at a SNP is represented by 0, 1, or 2 to indicate the number of non-reference alleles. Machine learning has not been explored for univariate associations, as most use it for interactions or other complex effects, but methods such as MDR [7], entropybased measures [8], and decision trees [9] could be employed to detect main effects. Even when multiple GWAS are combined through metaanalyses within a consortium, each GWAS is typically analyzed with a single approach and the resulting summary statistics are utilized in meta-analyses

Objectives

Methods

Results

Discussion

Conclusion