HALC: High throughput algorithm for long read error correction

Ergude Bao,Lingxiao Lan

doi:10.1186/s12859-017-1610-3

Abstract

BackgroundThe third generation PacBio SMRT long reads can effectively address the read length issue of the second generation sequencing technology, but contain approximately 15% sequencing errors. Several error correction algorithms have been designed to efficiently reduce the error rate to 1%, but they discard large amounts of uncorrected bases and thus lead to low throughput. This loss of bases could limit the completeness of downstream assemblies and the accuracy of analysis.ResultsHere, we introduce HALC, a high throughput algorithm for long read error correction. HALC aligns the long reads to short read contigs from the same species with a relatively low identity requirement so that a long read region can be aligned to at least one contig region, including its true genome region’s repeats in the contigs sufficiently similar to it (similar repeat based alignment approach). It then constructs a contig graph and, for each long read, references the other long reads’ alignments to find the most accurate alignment and correct it with the aligned contig regions (long read support based validation approach). Even though some long read regions without the true genome regions in the contigs are corrected with their repeats, this approach makes it possible to further refine these long read regions with the initial insufficient short reads and correct the uncorrected regions in between. In our performance tests on E. coli, A. thaliana and Maylandia zebra data sets, HALC was able to obtain 6.7-41.1% higher throughput than the existing algorithms while maintaining comparable accuracy. The HALC corrected long reads can thus result in 11.4-60.7% longer assembled contigs than the existing algorithms.ConclusionsThe HALC software can be downloaded for free from this site: https://github.com/lanl001/halc.

Highlights

The third generation PacBio SMRT long reads can effectively address the read length issue of the second generation sequencing technology, but contain approximately 15% sequencing errors
Experimental design To evaluate the performance of HALC, we ran HALC on three data sets from the species, E. coli, A. thaliana and Maylandia zebra, of small, medium and large genome sizes, respectively
Except for PacBioToCA and LSC, the average read length of all the algorithms is inversely proportional to the throughput because more but shorter reads can be obtained with higher throughput

Summary

Introduction

Several error correction algorithms have been designed to efficiently reduce the error rate to 1%, but they discard large amounts of uncorrected bases and lead to low throughput This loss of bases could limit the completeness of downstream assemblies and the accuracy of analysis. A tremendous number of species have been assembled from short reads, but most of the assemblies are incomplete and fragmented into several thousands of contigs [2, 3] To address this issue, the PacBio SMRT sequencing technology, as a representative of third generation sequencing technology, has been attracting more and more attention since its commercial release in 2010 [4]. Depending on how the long reads are used, sequencing projects can be grouped into two classes

Methods

Results

Discussion

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: BMC Bioinformatics	Publication Date: Apr 5, 2017
Citations: 59	License type: open-access

R Discovery Prime

R Discovery Prime

HALC: High throughput algorithm for long read error correction

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Bioinformatics

Lead the way for us

Similar Papers

Athena: Automated Tuning of k-mer based Genomic Error Correction Algorithms using Language Models
Mustafa Abdallah ... Somali Chaterji
Scientific Reports | VOL. 9
Mustafa Abdallah, et. al.Mustafa Abdallah ... Somali Chaterji
06 Nov 2019
Scientific Reports | VOL. 9

Cerulean: A Hybrid Assembly Using High Throughput Short and Long Reads
Viraj Deshpande ... Son Pham
-
Viraj Deshpande, et. al.Viraj Deshpande ... Son Pham
01 Jan 2013
01 Jan 2013

New decoding techniques for modified product code used in critical applications
David C.C Freitas ... João C.M Mota
Microelectronics Reliability | VOL. 128
David C.C Freitas, et. al.David C.C Freitas ... João C.M Mota
13 Dec 2021
Microelectronics Reliability | VOL. 128

An Error Correction Algorithm for NGS Data
Mehdi Kchouk ... Mourad Elloumi
-
Mehdi Kchouk, et. al.Mehdi Kchouk ... Mourad Elloumi
01 Aug 2017
01 Aug 2017

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

HALC: High throughput algorithm for long read error correction

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Bioinformatics