Abstract

A main application for mRNA sequencing (mRNAseq) is determining lists of differentially-expressed genes (DEGs) between two or more conditions. Several software packages exist to produce DEGs from mRNAseq data, but they typically yield different DEGs, sometimes markedly so. The underlying probability model used to describe mRNAseq data is central to deriving DEGs, and not surprisingly most softwares use different models and assumptions to analyze mRNAseq data. Here, we propose a mechanistic justification to model mRNAseq as a binomial process, with data from technical replicates given by a binomial distribution, and data from biological replicates well-described by a beta-binomial distribution. We demonstrate good agreement of this model with two large datasets. We show that an emergent feature of the beta-binomial distribution, given parameter regimes typical for mRNAseq experiments, is the well-known quadratic polynomial scaling of variance with the mean. The so-called dispersion parameter controls this scaling, and our analysis suggests that the dispersion parameter is a continually decreasing function of the mean, as opposed to current approaches that impose an asymptotic value to the dispersion parameter at moderate mean read counts. We show how this leads to current approaches overestimating variance for moderately to highly expressed genes, which inflates false negative rates. Describing mRNAseq data with a beta-binomial distribution thus may be preferred since its parameters are relatable to the mechanistic underpinnings of the technique and may improve the consistency of DEG analysis across softwares, particularly for moderately to highly expressed genes.

Highlights

  • Since the advent of the microarray around the turn of the 20th century, whole transcriptome profiling has been of great importance to systems biology [1,2,3,4,5,6,7,8]

  • The mRNA samples are converted into a library that is compatible with the sequencing platform

  • To demonstrate explicitly how overestimating dispersion could lead to identification of new differentiallyexpressed genes (DEGs), we explored a comparison of treated vs. control data for the unique molecular identifier (UMI) data set (DMSO vs. sorafenib) and the Gierlinski dataset (WT vs. Δsnf2)

Read more

Summary

Introduction

Since the advent of the microarray around the turn of the 20th century, whole transcriptome profiling has been of great importance to systems biology [1,2,3,4,5,6,7,8]. The ability to observe how every transcript in a cell population responds to, for example, treatment with a drug or a change in the expression of a gene-of-interest, gives insight into the wiring and function of biological systems. A common method for deriving biological knowledge from such perturbation experiments is to identify lists of differentially expressed transcripts or genes (DEGs) between. The centralized collection of most transcriptome experiments in databases such as the gene expression omnibus (GEO) and the connectivity map (CMAP) has given further insight by enabling the use of big data methods to identify general trends and connections that do not emerge from a single experiment (or even a handful) [14,15,16]

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call