Analysis of human mini-exome sequencing data from Genetic Analysis Workshop 17 using a Bayesian hierarchical mixture model

Julio S Bueno Filho,Kristin J Meyers,Gota Morota,Matthew J Maenner,Corinne D Engelman,Lina M Vera-Cala,Quoc Tran

doi:10.1186/1753-6561-5-s9-s93

Abstract

Next-generation sequencing technologies are rapidly changing the field of genetic epidemiology and enabling exploration of the full allele frequency spectrum underlying complex diseases. Although sequencing technologies have shifted our focus toward rare genetic variants, statistical methods traditionally used in genetic association studies are inadequate for estimating effects of low minor allele frequency variants. Four our study we use the Genetic Analysis Workshop 17 data from 697 unrelated individuals (genotypes for 24,487 autosomal variants from 3,205 genes). We apply a Bayesian hierarchical mixture model to identify genes associated with a simulated binary phenotype using a transformed genotype design matrix weighted by allele frequencies. A Metropolis Hasting algorithm is used to jointly sample each indicator variable and additive genetic effect pair from its conditional posterior distribution, and remaining parameters are sampled by Gibbs sampling. This method identified 58 genes with a posterior probability greater than 0.8 for being associated with the phenotype. One of these 58 genes, PIK3C2B was correctly identified as being associated with affected status based on the simulation process. This project demonstrates the utility of Bayesian hierarchical mixture models using a transformed genotype matrix to detect genes containing rare and common variants associated with a binary phenotype.

Highlights

The past decade of human genetics research has been dominated by Genome-Wide Association Studies (GWAS) and the common disease/common variant hypothesis
The purpose of this study is to use the Genetic Analysis Workshop 17 (GAW17) mini-exome sequence data to test the application of a Bayesian hierarchical mixture model for identifying genes containing rare and common variants associated with a simulated binomial outcome
Of the 24,487 single-nucleotide polymorphisms (SNPs) included in the GAW17 data set, we removed 576 SNPs because of their collinearity with another SNP or combination of other SNPs in the same gene

Summary

Introduction

The past decade of human genetics research has been dominated by Genome-Wide Association Studies (GWAS) and the common disease/common variant hypothesis. In addition to being abundant, The prevailing statistical approach for estimating genetic effects in GWAS has been to test one SNP at a time for association with the phenotype of interest using linear or logistic regression. This approach is limited in sequencing studies because, by definition, sequencing studies identify rare genetic variants that individually do not provide statistical power for detecting associations. These pooling methods increase statistical power for implicating a gene or genomic region, they are limited because they cannot determine which of the pooled SNPs has causal potential and because they ignore complexity by modeling only one gene at a time

Objectives

Methods

Results

Discussion

Conclusion