Transcriptome assembly from long-read RNA-seq alignments with StringTie2

Sam Kovaka,Steven L Salzberg,Geo M Pertea,Roham Razaghi,Aleksey V Zimin,Mihaela Pertea

doi:10.1186/s13059-019-1910-1

Abstract

RNA sequencing using the latest single-molecule sequencing instruments produces reads that are thousands of nucleotides long. The ability to assemble these long reads can greatly improve the sensitivity of long-read analyses. Here we present StringTie2, a reference-guided transcriptome assembler that works with both short and long reads. StringTie2 includes new methods to handle the high error rate of long reads and offers the ability to work with full-length super-reads assembled from short reads, which further improves the quality of short-read assemblies. StringTie2 is more accurate and faster and uses less memory than all comparable short-read and long-read analysis tools.

Highlights

Measuring the abundances of transcripts in an RNAsequencing (RNA-seq) dataset is a powerful way to understand the workings of a cell
Transcriptome assembly of short RNA-seq reads We first used simulated human data to compare the sensitivity and precision of StringTie2, with and without super-reads, to that of Scallop (Fig. 1), one of the most recent transcriptome assemblers for short RNA-seq data, which was shown on some data to yield an improvement in assembly accuracy over StringTie1 [11]
StringTie2, with default parameters, is both more sensitive and more precise than Scallop on this data, and the use of super-reads increases both the sensitivity and precision of StringTie2 compared to using short-read alignments alone

Summary

Introduction

Measuring the abundances of transcripts in an RNAsequencing (RNA-seq) dataset is a powerful way to understand the workings of a cell. Aligning reads to a reference genome can provide rough estimates of the average expression of genes and hint at differential use of splice sites [1], but to create an accurate picture of gene activity, one must assemble collections of reads into transcripts. Alternative splicing is very common in eukaryotes, with an estimated 90% of human multi-exon protein-coding genes and 30% of non-coding RNA (ncRNA) genes having multiple isoforms [2, 3]. Second-generation sequencers, such as those from Illumina, can produce hundreds of millions of short (~ 100 bp) RNA-seq reads. Reads of this length usually span no more than two exons, except in cases of very small exons. By assembling the short reads, we can Kovaka et al Genome Biology (2019) 20:278 achieve higher sensitivity and precision than StringTie (version 1.3) and TransComb

Methods

Results

Conclusion