Hierarchical probabilistic models for multiple gene/variant associations based on next-generation sequencing data.

Dimitrios V Vavoulis,Anna Schuh,Jenny C Taylor,Ziv Bar-Joseph

doi:10.1093/bioinformatics/btx355

Abstract

MotivationThe identification of genetic variants influencing gene expression (known as expression quantitative trait loci or eQTLs) is important in unravelling the genetic basis of complex traits. Detecting multiple eQTLs simultaneously in a population based on paired DNA-seq and RNA-seq assays employs two competing types of models: models which rely on appropriate transformations of RNA-seq data (and are powered by a mature mathematical theory), or count-based models, which represent digital gene expression explicitly, thus rendering such transformations unnecessary. The latter constitutes an immensely popular methodology, which is however plagued by mathematical intractability.ResultsWe develop tractable count-based models, which are amenable to efficient estimation through the introduction of latent variables and the appropriate application of recent statistical theory in a sparse Bayesian modelling framework. Furthermore, we examine several transformation methods for RNA-seq read counts and we introduce arcsin, logit and Laplace smoothing as preprocessing steps for transformation-based models. Using natural and carefully simulated data from the 1000 Genomes and gEUVADIS projects, we benchmark both approaches under a variety of scenarios, including the presence of noise and violation of basic model assumptions. We demonstrate that an arcsin transformation of Laplace-smoothed data is at least as good as state-of-the-art models, particularly at small samples. Furthermore, we show that an over-dispersed Poisson model is comparable to the celebrated Negative Binomial, but much easier to estimate. These results provide strong support for transformation-based versus count-based (particularly Negative-Binomial-based) models for eQTL mapping.Availability and implementationAll methods are implemented in the free software eQTLseq: https://github.com/dvav/eQTLseqSupplementary information Supplementary data are available at Bioinformatics online.

Highlights

The identification of genetic variants affecting gene expression is an important step in unravelling the genetic basis of complex traits, including diseases (Albert and Kruglyak, 2015; Cookson et al, 2009; Joehanes et al, 2017)
Count data produced by assays such as RNA-seq (Wang et al, 2009), constitute a digital measure of gene expression, making methodologies developed for continuous microarray data not directly applicable
Identifying gene/variant associations in a population, based on paired RNA-seq and DNA-seq assays, can be formulated in terms of multiple/multivariate regression, where digital gene expression and genotype data play the role of response and explanatory variables, respectively

Summary

Introduction

The identification of genetic variants affecting gene expression (known as expression quantitative trait loci or eQTLs) is an important step in unravelling the genetic basis of complex traits, including diseases (Albert and Kruglyak, 2015; Cookson et al, 2009; Joehanes et al, 2017). Count data produced by assays such as RNA-seq (Wang et al, 2009), constitute a digital measure of gene expression, making methodologies developed for continuous microarray data not directly applicable. A straightforward approach to eQTL mapping using RNA-seq would be to transform digital expression data (Zwiener et al, 2014) and proceed using methodologies developed for micro-arrays, which usually assume normally distributed data (Bottolo et al, 2011; Cheng et al, 2014; Flutre et al, 2013; Shabalin, 2012; Yi and Xu, 2008). The basic obstacle in directly applying such methods on normalized RNA-seq data are the high degree of skewness, extreme values and a non-trivial mean-variance relationship, which commonly characterize such data. While the aforementioned transformations are not specific to RNA-seq, variance-stabilizing approaches that explicitly model the mean-variance relationship in such data are provided by specialized software, such as DESeq (functions rlog and vst) (Love et al, 2014) and limma (function voom) (Law et al, 2014). The practical advantage of using appropriate data transformations is the immediate availability of analytical methods, which (being built around the assumption of normally distributed data) are powered by a tractable mathematical theory

Methods

Results

Conclusion