In search of perfect reads

Soumitra Pal,Srinivas Aluru

doi:10.1186/1471-2105-16-s17-s7

Abstract

BackgroundContinued advances in next generation short-read sequencing technologies are increasing throughput and read lengths, while driving down error rates. Taking advantage of the high coverage sampling used in many applications, several error correction algorithms have been developed to improve data quality further. However, correcting errors in high coverage sequence data requires significant computing resources.MethodsWe propose a different approach to handle erroneous sequence data. Presently, error rates of high-throughput platforms such as the Illumina HiSeq are within 1%. Moreover, the errors are not uniformly distributed in all reads, and a large percentage of reads are indeed error-free. Ability to predict such perfect reads can significantly impact the run-time complexity of applications. We present a simple and fast k-spectrum analysis based method to identify error-free reads. The filtration process to identify and weed out erroneous reads can be customized at several levels of stringency depending upon the downstream application need.ResultsOur experiments show that if around 80% of the reads in a dataset are perfect, then our method retains almost 99.9% of them with more than 90% precision rate. Though filtering out reads identified as erroneous by our method reduces the average coverage by about 7%, we found the remaining reads provide as uniform a coverage as the original dataset. We demonstrate the effectiveness of our approach on an example downstream application: we show that an error correction algorithm, Reptile, which rely on collectively analyzing the reads in a dataset to identify and correct erroneous bases, instead use reads predicted to be perfect by our method to correct the other reads, the overall accuracy improves further by up to 10%.ConclusionsThanks to the continuous technological improvements, the coverage and accuracy of reads from dominant sequencing platforms have now reached an extent where we can envision just filtering out reads with errors, thus making error correction less important. Our algorithm is a first attempt to propose and demonstrate this new paradigm. Moreover, our demonstration is applicable to any error correction algorithm as a downstream application, this in turn gives a new class of error correcting algorithms as a by product.

Highlights

Continued advances in generation short-read sequencing technologies are increasing throughput and read lengths, while driving down error rates
The focus of this work is applications of highthroughput sequencing in which a single genome is sampled at high coverage, such as resequencing and de novo sequencing
Several error correction algorithms for haploid genomes have been developed, using k-spectrum [1,2,3,4], suffix trees [5,6,7], or multiple sequence alignments [8,9] to identify overlapping reads

Summary

Introduction

Continued advances in generation short-read sequencing technologies are increasing throughput and read lengths, while driving down error rates. The focus of this work is applications of highthroughput sequencing in which a single genome is sampled at high coverage, such as resequencing and de novo sequencing. In these cases, the infrequent occurrence of errors in reads, and the apparent lack of affinity of errors to any fixed genomic location, provide a way to detect and correct erroneous bases in reads. If the reads covering a specific genomic position can be identified and properly positioned relative to their locations of genomic occurrence, this layout can be used to infer the true base by majority vote and correct the others. For a detailed survey of error correction methods, see [10,11]

Methods

Results

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: BMC Bioinformatics	Publication Date: Dec 1, 2015
Citations: 4	License type: cc-by

R Discovery Prime

R Discovery Prime

In search of perfect reads

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Bioinformatics

Lead the way for us

Similar Papers

Athena: Automated Tuning of k-mer based Genomic Error Correction Algorithms using Language Models
Mustafa Abdallah ... Hany Ahmed
Scientific Reports | VOL. 9
Mustafa Abdallah, et. al.Mustafa Abdallah ... Hany Ahmed
06 Nov 2019
Scientific Reports | VOL. 9

Simultaneous compression of multiple error-corrected short-read sets for faster data transmission and better de novo assemblies.
Tao Tang ... Wenjian Wang
Briefings in functional genomics | VOL. 21
Tao Tang, et. al.Tao Tang ... Wenjian Wang
14 Jul 2022
Briefings in functional genomics | VOL. 21

New decoding techniques for modified product code used in critical applications
David C.C Freitas ... João C.M Mota
Microelectronics Reliability | VOL. 128
David C.C Freitas, et. al.David C.C Freitas ... João C.M Mota
13 Dec 2021
Microelectronics Reliability | VOL. 128

Calibration and error correction algorithms for smart pressure sensors
M Mozek ... U Aljancic
-
M Mozek, et. al.M Mozek ... U Aljancic
07 Aug 2002
07 Aug 2002

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

In search of perfect reads

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Bioinformatics