A flexible count data model to fit the wide diversity of expression profiles arising from extensively replicated RNA-seq experiments

Mikel Esnaola,Robert Castelo,Pedro Puig,David Gonzalez,Juan R Gonzalez

doi:10.1186/1471-2105-14-254

Abstract

BackgroundHigh-throughput RNA sequencing (RNA-seq) offers unprecedented power to capture the real dynamics of gene expression. Experimental designs with extensive biological replication present a unique opportunity to exploit this feature and distinguish expression profiles with higher resolution. RNA-seq data analysis methods so far have been mostly applied to data sets with few replicates and their default settings try to provide the best performance under this constraint. These methods are based on two well-known count data distributions: the Poisson and the negative binomial. The way to properly calibrate them with large RNA-seq data sets is not trivial for the non-expert bioinformatics user.ResultsHere we show that expression profiles produced by extensively-replicated RNA-seq experiments lead to a rich diversity of count data distributions beyond the Poisson and the negative binomial, such as Poisson-Inverse Gaussian or Pólya-Aeppli, which can be captured by a more general family of count data distributions called the Poisson-Tweedie. The flexibility of the Poisson-Tweedie family enables a direct fitting of emerging features of large expression profiles, such as heavy-tails or zero-inflation, without the need to alter a single configuration parameter. We provide a software package for R called tweeDEseq implementing a new test for differential expression based on the Poisson-Tweedie family. Using simulations on synthetic and real RNA-seq data we show that tweeDEseq yields P-values that are equally or more accurate than competing methods under different configuration parameters. By surveying the tiny fraction of sex-specific gene expression changes in human lymphoblastoid cell lines, we also show that tweeDEseq accurately detects differentially expressed genes in a real large RNA-seq data set with improved performance and reproducibility over the previously compared methodologies. Finally, we compared the results with those obtained from microarrays in order to check for reproducibility.ConclusionsRNA-seq data with many replicates leads to a handful of count data distributions which can be accurately estimated with the statistical model illustrated in this paper. This method provides a better fit to the underlying biological variability; this may be critical when comparing groups of RNA-seq samples with markedly different count data distributions. The tweeDEseq package forms part of the Bioconductor project and it is available for download at http://www.bioconductor.org.

Highlights

High-throughput RNA sequencing (RNA-seq) offers unprecedented power to capture the real dynamics of gene expression
We provide data supporting the hypothesis that the lack of fit to negative binomial (NB) distributions may be related to the dynamics of gene expression unveiled by RNA-seq technology
We demonstrate with simulations on synthetic and real RNA-seq data how a single run of our approach provides P-values that are or more accurate than NB-based competing methods calibrated with a variety of configuration parameters

Summary

Introduction

High-throughput RNA sequencing (RNA-seq) offers unprecedented power to capture the real dynamics of gene expression. For DE analysis, after some pre-processing steps that include the alignment of the sequenced reads to a reference genome and their summarization into features of interest (e.g., genes), raw RNA-seq data is transformed into an initial table of counts. This table should be normalized [2,3,4] in order to adjust for both technical variability and the expression properties of the samples, such that the estimated normalization factors and offsets applied to the RNA-seq count data describe as accurately as possible the relative number of copies of each feature throughout every sample. As opposed to the continuous nature of log-scale fluorescence units in microarray data, RNA-seq expression levels are defined by discrete count data, and require specific DE analysis techniques

Methods

Results

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: BMC Bioinformatics	Publication Date: Aug 21, 2013
Citations: 54	License type: cc-by

R Discovery Prime

R Discovery Prime

A flexible count data model to fit the wide diversity of expression profiles arising from extensively replicated RNA-seq experiments

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Bioinformatics

Lead the way for us

Similar Papers

DeGPS is a powerful tool for detecting differential expression in RNA-sequencing studies.
Chen Chu ... Yan Lu
BMC Genomics | VOL. 16
Chen Chu, et. al.Chen Chu ... Yan Lu
13 Jun 2015
BMC Genomics | VOL. 16

Differential expression analysis of RNA sequencing data by incorporating non-exonic mapped reads.
Hung-I Harry Chen ... Zhao Lai
BMC genomics | VOL. Suppl 16 7
Hung-I Harry Chen, et. al.Hung-I Harry Chen ... Zhao Lai
11 Jun 2015
BMC genomics | VOL. Suppl 16 7

Accurate Estimation of Expression Levels of Homologous Genes in RNA-seq Experiments
Bogdan Paşaniuc ... Eran Halperin
Journal of Computational Biology | VOL. 18
Bogdan Paşaniuc, et. al.Bogdan Paşaniuc ... Eran Halperin
01 Mar 2011
Journal of Computational Biology | VOL. 18

Changes in chromatin accessibility are not concordant with transcriptional changes for single-factor perturbations.
Karun Kiani ... Arjun Raj
Molecular Systems Biology | VOL. 18
Karun Kiani, et. al.Karun Kiani ... Arjun Raj
01 Sep 2022
Molecular Systems Biology | VOL. 18

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

A flexible count data model to fit the wide diversity of expression profiles arising from extensively replicated RNA-seq experiments

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Bioinformatics