A hybrid and scalable error correction algorithm for indel and substitution errors of long reads

Arghya Kusum Das,Sayan Goswami,Kisung Lee,Seung-Jong Park

doi:10.1186/s12864-019-6286-9

Arghya Kusum Das, Sayan Goswami + Show 2 more

Open Access

https://doi.org/10.1186/s12864-019-6286-9

Copy DOI

Abstract

BackgroundLong-read sequencing has shown the promises to overcome the short length limitations of second-generation sequencing by providing more complete assembly. However, the computation of the long sequencing reads is challenged by their higher error rates (e.g., 13% vs. 1%) and higher cost ($0.3 vs. $0.03 per Mbp) compared to the short reads.MethodsIn this paper, we present a new hybrid error correction tool, called ParLECH (Parallel Long-read Error Correction using Hybrid methodology). The error correction algorithm of ParLECH is distributed in nature and efficiently utilizes the k-mer coverage information of high throughput Illumina short-read sequences to rectify the PacBio long-read sequences.ParLECH first constructs a de Bruijn graph from the short reads, and then replaces the indel error regions of the long reads with their corresponding widest path (or maximum min-coverage path) in the short read-based de Bruijn graph. ParLECH then utilizes the k-mer coverage information of the short reads to divide each long read into a sequence of low and high coverage regions, followed by a majority voting to rectify each substituted error base.ResultsParLECH outperforms latest state-of-the-art hybrid error correction methods on real PacBio datasets. Our experimental evaluation results demonstrate that ParLECH can correct large-scale real-world datasets in an accurate and scalable manner. ParLECH can correct the indel errors of human genome PacBio long reads (312 GB) with Illumina short reads (452 GB) in less than 29 h using 128 compute nodes. ParLECH can align more than 92% bases of an E. coli PacBio dataset with the reference genome, proving its accuracy.ConclusionParLECH can scale to over terabytes of sequencing data using hundreds of computing nodes. The proposed hybrid error correction methodology is novel and rectifies both indel and substitution errors present in the original long reads or newly introduced by the short reads.

Highlights

Long-read sequencing has shown the promises to overcome the short length limitations of second-generation sequencing by providing more complete assembly
We demonstrate the scalability of Parallel long-read error correction using hybrid methodology (ParLECH) by correcting a 312GB human genome PacBio dataset, with leveraging a 452 Giga bytes (GB) Illumina dataset (64x coverage), on 128 nodes in less than 29 h
Datasets We evaluate ParLECH with respect to four real data sets including E. coli, yeast, fruit fly, and human genome

Summary

Introduction

Long-read sequencing has shown the promises to overcome the short length limitations of second-generation sequencing by providing more complete assembly. The second-generation sequencing technologies (e.g., Illumina, Ion Torrent) have been providing researchers with the required throughput at significantly low cost ($0.03/million-bases), which enabled the discovery of many new species and variants. They are being widely utilized for understanding the complex phenotypes, they are typically incapable of resolving long repetitive elements, common in various genomes (e.g., eukaryotic genomes), because of the short read lengths [1]. The production costs of these long sequences are almost 10 times more expensive than those of the short reads, and the analysis of these long reads is severely constrained by their higher error rate

Methods

Results

Discussion

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: BMC Genomics	Publication Date: Dec 1, 2019
Citations: 10	License type: open-access

R Discovery Prime

R Discovery Prime

A hybrid and scalable error correction algorithm for indel and substitution errors of long reads

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Genomics

Lead the way for us

Similar Papers

Predictable Reliability and Packet Loss Domain Separation for IP Media Delivery
M Gorius ... Th Herfet
-
M Gorius, et. al.M Gorius ... Th Herfet
01 Jun 2011
01 Jun 2011

Jabba: hybrid error correction for long sequencing reads.
Giles Miclotte ... Pieter Audenaert
Algorithms for Molecular Biology | VOL. 11
Giles Miclotte, et. al.Giles Miclotte ... Pieter Audenaert
03 May 2016
Algorithms for Molecular Biology | VOL. 11

ParLECH: Parallel Long-Read Error Correction with Hadoop
Arghya Kusum Das ... Kisung Lee
-
Arghya Kusum Das, et. al.Arghya Kusum Das ... Kisung Lee
01 Dec 2018
01 Dec 2018

HECIL: A Hybrid Error Correction Algorithm for Long Reads with Iterative Learning
Olivia Choudhury ... Scott J Emrich
Scientific Reports | VOL. 8
Olivia Choudhury, et. al.Olivia Choudhury ... Scott J Emrich
02 Jul 2018
Scientific Reports | VOL. 8

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

A hybrid and scalable error correction algorithm for indel and substitution errors of long reads

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Genomics