Abstract

MotivationNext-generation sequencing techniques revolutionized the study of RNA expression by permitting whole transcriptome analysis. However, sequencing reads generated from nested and multi-copy genes are often either misassigned or discarded, which greatly reduces both quantification accuracy and gene coverage.ResultsHere we present count corrector (CoCo), a read assignment pipeline that takes into account the multitude of overlapping and repetitive genes in the transcriptome of higher eukaryotes. CoCo uses a modified annotation file that highlights nested genes and proportionally distributes multimapped reads between repeated sequences. CoCo salvages over 15% of discarded aligned RNA-seq reads and significantly changes the abundance estimates for both coding and non-coding RNA as validated by PCR and bedgraph comparisons.Availability and implementationThe CoCo software is an open source package written in Python and available from http://gitlabscottgroup.med.usherbrooke.ca/scott-group/coco.Supplementary information Supplementary data are available at Bioinformatics online.

Highlights

  • Detection and quantification of RNA transcripts is a critical step to understand the mechanism of gene expression and its impact on cell function

  • We have developed the Count Corrector (CoCo) package, which consists of three main modules: 1) the correct_annotation module which generates gapped annotation files in which the regions of the host gene transcript features overlapping with nested genes are precisely removed (Fig. 1B), 2) the correct_count module which recuperates the reads associated with nested and multimapped genes using the modified annotation (Fig. 1D and E), and 3) the correct_bedgraph annotation which produces accurate representations of paired-end reads (Supplementary Fig. 2)

  • To test the quantification accuracy of the CoCo pipeline, we examined its capacity to correctly assign and quantify sequencing reads using four RNA-sequencing techniques (RNA-seq) datasets, and compared its quantification to those of the main read assignment pipelines available

Read more

Summary

Introduction

Detection and quantification of RNA transcripts is a critical step to understand the mechanism of gene expression and its impact on cell function. Diverse library preparation protocols exist, the most commonly used ones focusing on particular classes of RNA through enrichment steps Such strategies include polyA enrichment, non-rRNA enrichment (e.g. rRNA depletion), small RNA enrichment and enrichment for RNAs bound to specific factors (Conesa, et al, 2016; Hrdlickova, et al, 2017; O'Neil, et al, 2013). In the case of the CH507-513H4.1 locus which hosts miRNAs miR-3648 and miR-3687, the reads were originally attributed to the miRNA despite the absence of corresponding peaks in the bedgraph This inappropriate assignment is no longer observed following background correction (Supplementary Fig. 9)

Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call