An efficient error correction algorithm using FM-index

Yao-Ting Huang,Yu-Wen Huang

doi:10.1186/s12859-017-1940-1

Abstract

BackgroundHigh-throughput sequencing offers higher throughput and lower cost for sequencing a genome. However, sequencing errors, including mismatches and indels, may be produced during sequencing. Because, errors may reduce the accuracy of subsequent de novo assembly, error correction is necessary prior to assembly. However, existing correction methods still face trade-offs among correction power, accuracy, and speed.ResultsWe develop a novel overlap-based error correction algorithm using FM-index (called FMOE). FMOE first identifies overlapping reads by aligning a query read simultaneously against multiple reads compressed by FM-index. Subsequently, sequencing errors are corrected by k-mer voting from overlapping reads only. The experimental results indicate that FMOE has highest correction power with comparable accuracy and speed. Our algorithm performs better in long-read than short-read datasets when compared with others. The assembly results indicated different algorithms has its own strength and weakness, whereas FMOE is good for long or good-quality reads.ConclusionsFMOE is freely available at https://github.com/ythuang0522/FMOC.

Highlights

High-throughput sequencing offers higher throughput and lower cost for sequencing a genome
This paper presented a novel overlap-based error correction algorithm using FM-index
Given a query read, we first identify reads overlapping with the query by performing alignment against reads compressed in FM-index, construct a multiple-sequence alignment (MSA) matrix, and replace the less-frequent alleles on the query with the most-frequent one at the same locus

Summary

Introduction

High-throughput sequencing offers higher throughput and lower cost for sequencing a genome. The reads generated by generation sequencing platforms (e.g., Illumina, Roche 454) may contain several types of errors including mismatches, insertions and deletions (collectively termed indels) [1]. These errors bring great challenges of subsequent genome assembly algorithms, because false read overlaps may be produced, which further leads to fragmented assembly or even misassembly. These errors will increase the size of assembly graph due to erroneous vertices and edges, which implies requirement of larger memory usage and computational time.

Methods

Results

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: BMC bioinformatics	Publication Date: Nov 28, 2017
Citations: 9	License type: open-access

R Discovery Prime

R Discovery Prime

An efficient error correction algorithm using FM-index

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC bioinformatics

Lead the way for us

Similar Papers

Correction: Improved Lower Bounds of DNA Tags Based on a Modified Genetic Algorithm.
Bin Wang ... Qiang Zhang
PloS one | VOL. 10
Bin Wang, et. al.Bin Wang ... Qiang Zhang
23 Jun 2015
PloS one | VOL. 10

A hybrid correcting method considering heterozygous variations by a comprehensive probabilistic model
Jiaqi Liu ... Xiaoyan Zhu
BMC Genomics | VOL. 21
Jiaqi Liu, et. al.Jiaqi Liu ... Xiaoyan Zhu
01 Nov 2020
BMC Genomics | VOL. 21

Factors that affect large subunit ribosomal DNA amplicon sequencing studies of fungal communities: classification method, primer choice, and error.
Teresita M Porter ... Jason E Stajich
PloS one | VOL. 7
Teresita M Porter, et. al.Teresita M Porter ... Jason E Stajich
27 Apr 2012
PloS one | VOL. 7

HALC: High throughput algorithm for long read error correction
Ergude Bao ... Lingxiao Lan
BMC bioinformatics | VOL. 18
Ergude Bao, et. al.Ergude Bao ... Lingxiao Lan
05 Apr 2017
BMC bioinformatics | VOL. 18

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

An efficient error correction algorithm using FM-index

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC bioinformatics