TranscriptClean: variant-aware correction of indels, mismatches and splice junctions in long-read transcripts

Dana Wyman,Ali Mortazavi

doi:10.1093/bioinformatics/bty483

Dana Wyman, Ali Mortazavi

Open Access

PDF Available

https://doi.org/10.1093/bioinformatics/bty483

Copy DOI

Export

Save

Cite

Journal: Bioinformatics	Publication Date: Jun 15, 2018
Citations: 59	License type: CC BY 4.0

Affiliation: University of California, Irvine

Abstract
Highlights/Summary
Full-Text PDF
Similar Papers

Abstract

Listen

MotivationLong-read, single-molecule sequencing platforms hold great potential for isoform discovery and characterization of multi-exon transcripts. However, their high error rates are an obstacle to distinguishing novel transcript isoforms from sequencing artifacts. Therefore, we developed the package TranscriptClean to correct mismatches, microindels and noncanonical splice junctions in mapped transcripts using the reference genome while preserving known variants.ResultsOur method corrects nearly all mismatches and indels present in a publically available human PacBio Iso-seq dataset, and rescues 39% of noncanonical splice junctions.Availability and implementationAll Python and R scripts used in this paper are available at https://github.com/dewyman/TranscriptClean.

Highlights

Conventional short-read RNA sequencing is widely used to quantify gene expression in a variety of applications
TAPIS and SQANTI deal with remaining errors by removing affected transcripts, the former using a splice junction quality filter, and the latter using a random forest classifier. While these methods produce cleaner Pacific Biosciences (PacBio) datasets, none of them attempt to correct noncanonical splice junctions arising from microindel errors
We present TranscriptClean, a program that uses the reference genome, splice annotation and a variant file to correct mismatches, microindels and noncanonical splice junctions in PacBio transcripts while preserving known variants

Summary

Introduction

Conventional short-read RNA sequencing is widely used to quantify gene expression in a variety of applications. Circular consensus correction and read polishing steps in the PacBio ToFU analysis pipeline can substantially reduce the error rate for most transcripts once raw reads are processed (Eid, 2009; Gordon, 2015). This correction process is only effective when multiple sequencing passes over the same insert molecule are available, which becomes less likely as transcript length increases (Rhoads and Au, 2015). TAPIS and SQANTI deal with remaining errors by removing affected transcripts, the former using a splice junction quality filter, and the latter using a random forest classifier While these methods produce cleaner PacBio datasets, none of them attempt to correct noncanonical splice junctions arising from microindel errors. High-confidence splice junctions (derived from same-sample mapped short RNA-seq reads or a reference annotation) and is changed to match the known junction when the distance between the NCSJ and its nearest high-confidence junction is microindel-sized

Indel and mismatch correction

Noncanonical splice junction correction

Results