FMLRC: Hybrid long read error correction using an FM-index

Jeremy R Wang,James Holt,Corbin D Jones,Leonard Mcmillan

doi:10.1186/s12859-018-2051-3

Jeremy R Wang, James Holt + Show 2 more

Open Access

PDF Available

https://doi.org/10.1186/s12859-018-2051-3

Copy DOI

Export

Save

Cite

Journal: BMC Bioinformatics	Publication Date: Feb 9, 2018
Citations: 121	License type: open-access

Affiliation: University of North Carolina at Chapel Hill

Abstract
Highlights/Summary
Full-Text PDF
Similar Papers

Abstract

Listen

BackgroundLong read sequencing is changing the landscape of genomic research, especially de novo assembly. Despite the high error rate inherent to long read technologies, increased read lengths dramatically improve the continuity and accuracy of genome assemblies. However, the cost and throughput of these technologies limits their application to complex genomes. One solution is to decrease the cost and time to assemble novel genomes by leveraging “hybrid” assemblies that use long reads for scaffolding and short reads for accuracy.ResultsWe describe a novel method leveraging a multi-string Burrows-Wheeler Transform with auxiliary FM-index to correct errors in long read sequences using a set of complementary short reads. We demonstrate that our method efficiently produces significantly more high quality corrected sequence than existing hybrid error-correction methods. We also show that our method produces more contiguous assemblies, in many cases, than existing state-of-the-art hybrid and long-read only de novo assembly methods.ConclusionOur method accurately corrects long read sequence data using complementary short reads. We demonstrate higher total throughput of corrected long reads and a corresponding increase in contiguity of the resulting de novo assemblies. Improved throughput and computational efficiency than existing methods will help better economically utilize emerging long read sequencing technologies.

Highlights

Long read sequencing is changing the landscape of genomic research, especially de novo assembly
We evaluated the accuracy of our method using complementary long- and short-read datasets for three species: E. coli K12, S. cerevisiae W303, and A. thaliana Ler-0
We assessed the effectiveness of our corrected reads for de novo assembly using a non-correcting assembler, Miniasm [6], and compared these data to several other state-of-the-art hybrid and long-read-only de novo assembly methods

Summary

Introduction

Long read sequencing is changing the landscape of genomic research, especially de novo assembly. Despite the high error rate inherent to long read technologies, increased read lengths dramatically improve the continuity and accuracy of genome assemblies. De novo genome assembly has benefitted dramatically from the introduction of so-called “long” read sequencing technologies These technologies, such as SMRT sequencing by Pacific Biosciences (Pacbio) and nanopore sequencing platforms by Oxford Nanopore Technologies, produce reads typically 10s of kilobases instead of hundreds of bases. For genomes of more complex eukaryotes and mammals, the computational resources required for effective de novo assembly are staggering and difficult to coordinate. This is driven largely by the pairwise overlap step required by all modern long read assemblers. While novel methods such as MHAP [5] and Minimap [6] aim to improve this, in practice, the computational time and memory required are often prohibitively expensive

Methods

Results

Discussion

Conclusion