HECIL: A Hybrid Error Correction Algorithm for Long Reads with Iterative Learning

Olivia Choudhury,Scott J Emrich,Ankush Chakrabarty

doi:10.1038/s41598-018-28364-3

Olivia Choudhury, Scott J Emrich + Show 1 more

Open Access

PDF Available

https://doi.org/10.1038/s41598-018-28364-3

Copy DOI

Export

Save

Cite

Abstract
Highlights/Summary
Full-Text PDF
Similar Papers

Abstract

Listen

Second-generation DNA sequencing techniques generate short reads that can result in fragmented genome assemblies. Third-generation sequencing platforms mitigate this limitation by producing longer reads that span across complex and repetitive regions. However, the usefulness of such long reads is limited because of high sequencing error rates. To exploit the full potential of these longer reads, it is imperative to correct the underlying errors. We propose HECIL—Hybrid Error Correction with Iterative Learning—a hybrid error correction framework that determines a correction policy for erroneous long reads, based on optimal combinations of decision weights obtained from short read alignments. We demonstrate that HECIL outperforms state-of-the-art error correction algorithms for an overwhelming majority of evaluation metrics on diverse, real-world data sets including E. coli, S. cerevisiae, and the malaria vector mosquito A. funestus. Additionally, we provide an optional avenue of improving the performance of HECIL’s core algorithm by introducing an iterative learning paradigm that enhances the correction policy at each iteration by incorporating knowledge gathered from previous iterations via data-driven confidence metrics assigned to prior corrections.

Highlights

Various correction algorithms have been proposed for reducing the currently high error rates prevalent in long reads
We filter long reads of E. coli to exclude reads shorter than 100 bp, creating a final set of 33,360
Due to the high computational effort required by proovread and CoLoRMap to correct the reads of all flowcells, we present a comparative analysis based on a representative selection of three flowcells: 1, 4, and 16

Summary

Introduction

Various correction algorithms have been proposed for reducing the currently high error rates prevalent in long reads. HGAP13 is a self-correcting algorithm (that is, it does not rely on additional sequencing data) that performs correction by computing multiple alignments of high coverage long reads. Another class of correction algorithms rely on short reads generated from the same (or related) samples, and is referred as hybrid correction algorithms. The iterative procedure further improves the quality of error correction both in terms of alignment and assembly-based metrics by incorporating knowledge derived from high-confidence corrections made in prior iterations. We speculate that the proposed iterative learning formalism can be incorporated into other contemporary hybrid error correction algorithms to improve performance, at the expense of total execution time

Methods

Results

Conclusion