Abstract

MotivationHigh-throughput sequencing has transformed the study of gene expression levels through RNA-seq, a technique that is now routinely used by various fields, such as genetic research or diagnostics. The advent of third generation sequencing technologies providing significantly longer reads opens up new possibilities. However, the high error rates common to these technologies set new bioinformatics challenges for the gapped alignment of reads to their genomic origin. In this study, we have explored how currently available RNA-seq splice-aware alignment tools cope with increased read lengths and error rates. All tested tools were initially developed for short NGS reads, but some have claimed support for long Pacific Biosciences (PacBio) or even Oxford Nanopore Technologies (ONT) MinION reads.ResultsThe tools were tested on synthetic and real datasets from two technologies (PacBio and ONT MinION). Alignment quality and resource usage were compared across different aligners. The effect of error correction of long reads was explored, both using self-correction and correction with an external short reads dataset. A tool was developed for evaluating RNA-seq alignment results. This tool can be used to compare the alignment of simulated reads to their genomic origin, or to compare the alignment of real reads to a set of annotated transcripts. Our tests show that while some RNA-seq aligners were unable to cope with long error-prone reads, others produced overall good results. We further show that alignment accuracy can be improved using error-corrected reads.Availability and implementation https://github.com/kkrizanovic/RNAseqEval, https://figshare.com/projects/RNAseq_benchmark/24391Supplementary information Supplementary data are available at Bioinformatics online.

Highlights

  • Over the past ten years, the use of generation sequencing (NGS) platforms, in particular Illumina, has expanded to dominate the genome and transcriptome sequencing market

  • We have explored how currently available RNA-seq splice-aware alignment tools cope with increased read lengths and error rates

  • While a Pacific Biosciences (PacBio) simulator is not entirely appropriate for Oxford Nanopore Technologies (ONT) MinION data, we felt that mimicking their read length and error profile should provide some useful insight

Read more

Summary

Introduction

Over the past ten years, the use of generation sequencing (NGS) platforms, in particular Illumina, has expanded to dominate the genome and transcriptome sequencing market. Their sequencing-bysynthesis approach is much cheaper and faster than the previously used Sanger sequencing. Two new sequencing technologies, the so-called “third generation sequencing technologies”, have emerged, that produce longer reads and hold numerous promises for genomic and transcriptomic studies. To reduce error rates, circularized fragments are sequenced multiple times and the subreads produced can be reconciled to produce higher-quality consensus “Reads of Insert” (ROIs, previously called Circular Consensus Reads). There is a trade-off between the ROIs length and accuracy because, with longer fragments accumulating fewer sequencing passes

Objectives
Methods
Results
Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.