Trimming of sequence reads alters RNA-Seq gene expression estimates.

Claire R Williams,Alyssa Baccarella,Charles C Kim,Jay Z Parrish

doi:10.1186/s12859-016-0956-2

Claire R Williams, Alyssa Baccarella + Show 2 more

Open Access

https://doi.org/10.1186/s12859-016-0956-2

Copy DOI

Abstract

BackgroundHigh-throughput RNA-Sequencing (RNA-Seq) has become the preferred technique for studying gene expression differences between biological samples and for discovering novel isoforms, though the techniques to analyze the resulting data are still immature. One pre-processing step that is widely but heterogeneously applied is trimming, in which low quality bases, identified by the probability that they are called incorrectly, are removed. However, the impact of trimming on subsequent alignment to a genome could influence downstream analyses including gene expression estimation; we hypothesized that this might occur in an inconsistent manner across different genes, resulting in differential bias.ResultsTo assess the effects of trimming on gene expression, we generated RNA-Seq data sets from four samples of larval Drosophila melanogaster sensory neurons, and used three trimming algorithms—SolexaQA, Trimmomatic, and ConDeTri—to perform quality-based trimming across a wide range of stringencies. After aligning the reads to the D. melanogaster genome with TopHat2, we used Cuffdiff2 to compare the original, untrimmed gene expression estimates to those following trimming. With the most aggressive trimming parameters, over ten percent of genes had significant changes in their estimated expression levels. This trend was seen with two additional RNA-Seq data sets and with alternative differential expression analysis pipelines. We found that the majority of the expression changes could be mitigated by imposing a minimum length filter following trimming, suggesting that the differential gene expression was primarily being driven by spurious mapping of short reads. Slight differences with the untrimmed data set remained after length filtering, which were associated with genes with low exon numbers and high GC content. Finally, an analysis of paired RNA-seq/microarray data sets suggests that no or modest trimming results in the most biologically accurate gene expression estimates.ConclusionsWe find that aggressive quality-based trimming has a large impact on the apparent makeup of RNA-Seq-based gene expression estimates, and that short reads can have a particularly strong impact. We conclude that implementation of trimming in RNA-Seq analysis workflows warrants caution, and if used, should be used in conjunction with a minimum read length filter to minimize the introduction of unpredictable changes in expression estimates.Electronic supplementary materialThe online version of this article (doi:10.1186/s12859-016-0956-2) contains supplementary material, which is available to authorized users.

Highlights

High-throughput RNA-Sequencing (RNA-Seq) has become the preferred technique for studying gene expression differences between biological samples and for discovering novel isoforms, though the techniques to analyze the resulting data are still immature
Quality-based trimming of ultralow-input RNA-Seq data increases mappability Previous work has shown that quality-based trimming of RNA-Seq data can lead to greatly increased mappability of reads [6]
Imposing minimum read length requirements reverts gene expression estimates to values closer to estimates produced from untrimmed reads, suggesting that untrimmed or trimmed, length-filtered reads—the latter of which likely represent the highest quality reads within a data set—may most accurately reflect the actual library composition

Summary

Introduction

High-throughput RNA-Sequencing (RNA-Seq) has become the preferred technique for studying gene expression differences between biological samples and for discovering novel isoforms, though the techniques to analyze the resulting data are still immature. The impact of trimming on subsequent alignment to a genome could influence downstream analyses including gene expression estimation; we hypothesized that this might occur in an inconsistent manner across different genes, resulting in differential bias. RNA sequencing (RNA-Seq) has supplanted microarrays as the preferred technique for gene expression analysis. One initial step of RNA-Seq analysis is to evaluate sequence read quality, which can vary substantially based on factors related to nucleic acid library preparation (e.g., adapter contamination, polymerase errors) and Williams et al BMC Bioinformatics (2016) 17:103 sequencing (e.g., cluster density, optical detection errors, phasing errors) [1]. Errors have a tendency to co-occur, such that reads with two errors are more common than would be predicted based on a model in which errors occur independently of one another [3]

Methods

Results

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: BMC Bioinformatics	Publication Date: Feb 25, 2016
Citations: 175	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Trimming of sequence reads alters RNA-Seq gene expression estimates.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Bioinformatics

Lead the way for us

Similar Papers

Differential Expression of Genes Associated with Oncogene-Induced Senescence and Senescence Associated Secretory Phenotype in the Absence of Differential Expression of High Molecular Risk Genes and Genes Associated with JAK-STAT Pathway in Sorted Cells of Patients with Polycythemia Vera and Primary Myelofibrosis
Chieh Lee Wong ... Michael Laffan
Blood | VOL. 128
Chieh Lee Wong, et. al.Chieh Lee Wong ... Michael Laffan
02 Dec 2016
Blood | VOL. 128

Author response: PI3K signaling specifies proximal-distal fate by driving a developmental gene regulatory network in SOX9+ mouse lung progenitors
Sharlene Fernandes ... Matthew C Gillen
-
Sharlene Fernandes, et. al.Sharlene Fernandes ... Matthew C Gillen
14 Jun 2022
14 Jun 2022

Genetic variants associated with rotator cuff tearing utilizing multiple population-based genetic resources
Robert Z Tashjian ... Craig C Teerlink
Journal of Shoulder and Elbow Surgery | VOL. 30
Robert Z Tashjian, et. al.Robert Z Tashjian ... Craig C Teerlink
12 Jul 2020
Journal of Shoulder and Elbow Surgery | VOL. 30

Towards reliable isoform quantification using RNA-SEQ data
Brian E Howard ... Steffen Heber
BMC Bioinformatics | VOL. 11
Brian E Howard, et. al.Brian E Howard ... Steffen Heber
01 Apr 2010
BMC Bioinformatics | VOL. 11

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Trimming of sequence reads alters RNA-Seq gene expression estimates.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Bioinformatics