Abstract

Motivation: High-throughput sequencing of mRNA (RNA-Seq) has led to tremendous improvements in the detection of expressed genes and reconstruction of RNA transcripts. However, the extensive dynamic range of gene expression, technical limitations and biases, as well as the observed complexity of the transcriptional landscape, pose profound computational challenges for transcriptome reconstruction.Results: We present the novel framework MITIE (Mixed Integer Transcript IdEntification) for simultaneous transcript reconstruction and quantification. We define a likelihood function based on the negative binomial distribution, use a regularization approach to select a few transcripts collectively explaining the observed read data and show how to find the optimal solution using Mixed Integer Programming. MITIE can (i) take advantage of known transcripts, (ii) reconstruct and quantify transcripts simultaneously in multiple samples, and (iii) resolve the location of multi-mapping reads. It is designed for genome- and assembly-based transcriptome reconstruction. We present an extensive study based on realistic simulated RNA-Seq data. When compared with state-of-the-art approaches, MITIE proves to be significantly more sensitive and overall more accurate. Moreover, MITIE yields substantial performance gains when used with multiple samples. We applied our system to 38 Drosophila melanogaster modENCODE RNA-Seq libraries and estimated the sensitivity of reconstructing omitted transcript annotations and the specificity with respect to annotated transcripts. Our results corroborate that a well-motivated objective paired with appropriate optimization techniques lead to significant improvements over the state-of-the-art in transcriptome reconstruction.Availability: MITIE is implemented in C++ and is available from http://bioweb.me/mitie under the GPL license.Contact: Jonas_Behr@web.de and raetsch@cbio.mskcc.orgSupplementary information: Supplementary data are available at Bioinformatics online.

Highlights

  • Most of the complexity of higher eukaryotic transcriptomes can be attributed to the encoding of multiple transcripts at a single genic locus by means of alternative splicing, transcription start and termination (e.g. Nilsen and Graveley, 2010; Ratsch et al, 2007; Schweikert et al, 2009)

  • The optimization problem formalized by MITIE generalizes to solve transcript prediction in the de novo setting, and we show in Section 4 that the MITIE strategy is superior to the dynamic programming-based strategy of Trinity

  • MITIE can build a segment graph based on given alignments of RNASeq reads to a genome or start with segment graphs obtained by other means, in particular by de novo assembly

Read more

Summary

Introduction

Most of the complexity of higher eukaryotic transcriptomes can be attributed to the encoding of multiple transcripts at a single genic locus by means of alternative splicing, transcription start and termination (e.g. Nilsen and Graveley, 2010; Ratsch et al, 2007; Schweikert et al, 2009). Alignment tools for RNA-Seq reads, such as PALMapper (De Bona et al, 2008; Jean et al, 2010), TopHat (Trapnell et al, 2009), MapSplice (Wang et al, 2010), Star (Dobin et al, 2012) or Gsnap (Wu and Nacu, 2010) are typically able to identify new exon–exon junctions, which are candidates for introns. This information can be compiled into a segment or splicing graph, a directed acyclic graph, where the nodes correspond to exonic segments and the edges correspond to intron candidates (cf Fig. 1 for an illustration). We will focus on genome-based transcript reconstruction when describing the approach and discuss de novo assembly whenever necessary

Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call