EC: an efficient error correction algorithm for short reads.

Subrata Saha,Sanguthevar Rajasekaran

doi:10.1186/1471-2105-16-s17-s2

Abstract

BackgroundIn highly parallel next-generation sequencing (NGS) techniques millions to billions of short reads are produced from a genomic sequence in a single run. Due to the limitation of the NGS technologies, there could be errors in the reads. The error rate of the reads can be reduced with trimming and by correcting the erroneous bases of the reads. It helps to achieve high quality data and the computational complexity of many biological applications will be greatly reduced if the reads are first corrected. We have developed a novel error correction algorithm called EC and compared it with four other state-of-the-art algorithms using both real and simulated sequencing reads.ResultsWe have done extensive and rigorous experiments that reveal that EC is indeed an effective, scalable, and efficient error correction tool. Real reads that we have employed in our performance evaluation are Illumina-generated short reads of various lengths. Six experimental datasets we have utilized are taken from sequence and read archive (SRA) at NCBI. The simulated reads are obtained by picking substrings from random positions of reference genomes. To introduce errors, some of the bases of the simulated reads are changed to other bases with some probabilities.ConclusionsError correction is a vital problem in biology especially for NGS data. In this paper we present a novel algorithm, called Error Corrector (EC), for correcting substitution errors in biological sequencing reads. We plan to investigate the possibility of employing the techniques introduced in this research paper to handle insertion and deletion errors also.Software availabilityThe implementation is freely available for non-commercial purposes. It can be downloaded from: http://engr.uconn.edu/~rajasek/EC.zip.

Highlights

In sequencing technology numerous small fragments are generated by shredding DNA molecule in random positions
In this article we propose an effective, efficient, and scalable error correction algorithm called EC (Error Corrector) to correct the errors introduced by next-generation sequencing (NGS) technologies
Based on the techniques used in correcting errors we can classify them into three types: k-spectrum based, suffix tree/array based, and multiple sequence alignment (MSA)-based

Summary

Introduction

In sequencing technology numerous small fragments are generated by shredding DNA molecule in random positions. The coverage of the reads in some specific regions of the genome can be low and again the reads can be erroneous due to the limitation of the NGS technologies These events in turn produce a gap and the resulting overlap graph will be clustered into multiple disconnected components. The first k-spectrum based error correction algorithm has been built into the assembly tool Euler SR [1,2] It uses a spectral alignment method where it deduces a spectrum of trusted (i.e., most probably true) k-mers from the input data and corrects each read in such a way that every read contains only sequences from the spectrum. Quake [5] applies the same k-mer spectrum framework as described above It introduces quality values and rates of specific miscalls computed from each sequencing project. It is based on calculating the weight of a k-mer as the weighted sum of all its instances, i.e., bases using the quality values assigned to each base

Methods

Results

Discussion

Conclusion