Abstract

BackgroundTranscriptome sequencing (RNA-Seq) has become the assay of choice for high-throughput studies of gene expression. However, as is the case with microarrays, major technology-related artifacts and biases affect the resulting expression measures. Normalization is therefore essential to ensure accurate inference of expression levels and subsequent analyses thereof.ResultsWe focus on biases related to GC-content and demonstrate the existence of strong sample-specific GC-content effects on RNA-Seq read counts, which can substantially bias differential expression analysis. We propose three simple within-lane gene-level GC-content normalization approaches and assess their performance on two different RNA-Seq datasets, involving different species and experimental designs. Our methods are compared to state-of-the-art normalization procedures in terms of bias and mean squared error for expression fold-change estimation and in terms of Type I error and p-value distributions for tests of differential expression. The exploratory data analysis and normalization methods proposed in this article are implemented in the open-source Bioconductor R package EDASeq.ConclusionsOur within-lane normalization procedures, followed by between-lane normalization, reduce GC-content bias and lead to more accurate estimates of expression fold-changes and tests of differential expression. Such results are crucial for the biological interpretation of RNA-Seq experiments, where downstream analyses can be sensitive to the supplied lists of genes.

Highlights

  • Transcriptome sequencing (RNA-Seq) has become the assay of choice for high-throughput studies of gene expression

  • We examine their performance on two different types of data: a new RNA-Seq dataset for yeast grown in three different media and well-known benchmarking RNA-Seq datasets for two types of human reference samples from the MicroArray Quality Control (MAQC) Project [23]

  • These observations suggest that the GC-content bias is likely to be introduced at the library preparation step

Read more

Summary

Introduction

Transcriptome sequencing (RNA-Seq) has become the assay of choice for high-throughput studies of gene expression. As with microarrays, major technologyrelated artifacts and biases affect the expression measures [3,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20] and normalization remains an important issue, despite initial optimistic claims such as: “One powerful advantage of RNA-Seq is that it can capture transcriptome dynamics across different tissues or. MRNA is converted to cDNA fragments which are sequenced to produce millions of short reads (typically 25-100 bases) These reads are mapped back to a reference genome and the number of reads mapping to a particular gene reflects the abundance of the transcript in the sample of interest. The read count will vary between replicate lanes as a result of differences in sequencing depth, i.e., total number of reads produced in a given lane

Objectives
Methods
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call