Abstract

MotivationLong-read, single-molecule sequencing platforms hold great potential for isoform discovery and characterization of multi-exon transcripts. However, their high error rates are an obstacle to distinguishing novel transcript isoforms from sequencing artifacts. Therefore, we developed the package TranscriptClean to correct mismatches, microindels and noncanonical splice junctions in mapped transcripts using the reference genome while preserving known variants.ResultsOur method corrects nearly all mismatches and indels present in a publically available human PacBio Iso-seq dataset, and rescues 39% of noncanonical splice junctions.Availability and implementationAll Python and R scripts used in this paper are available at https://github.com/dewyman/TranscriptClean.

Highlights

  • Conventional short-read RNA sequencing is widely used to quantify gene expression in a variety of applications

  • TAPIS and SQANTI deal with remaining errors by removing affected transcripts, the former using a splice junction quality filter, and the latter using a random forest classifier. While these methods produce cleaner Pacific Biosciences (PacBio) datasets, none of them attempt to correct noncanonical splice junctions arising from microindel errors

  • We present TranscriptClean, a program that uses the reference genome, splice annotation and a variant file to correct mismatches, microindels and noncanonical splice junctions in PacBio transcripts while preserving known variants

Read more

Summary

Introduction

Conventional short-read RNA sequencing is widely used to quantify gene expression in a variety of applications. Circular consensus correction and read polishing steps in the PacBio ToFU analysis pipeline can substantially reduce the error rate for most transcripts once raw reads are processed (Eid, 2009; Gordon, 2015). This correction process is only effective when multiple sequencing passes over the same insert molecule are available, which becomes less likely as transcript length increases (Rhoads and Au, 2015). TAPIS and SQANTI deal with remaining errors by removing affected transcripts, the former using a splice junction quality filter, and the latter using a random forest classifier While these methods produce cleaner PacBio datasets, none of them attempt to correct noncanonical splice junctions arising from microindel errors. High-confidence splice junctions (derived from same-sample mapped short RNA-seq reads or a reference annotation) and is changed to match the known junction when the distance between the NCSJ and its nearest high-confidence junction is microindel-sized

Indel and mismatch correction
Noncanonical splice junction correction
Results

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.