Abstract

Lighter is a fast, memory-efficient tool for correcting sequencing errors. Lighter avoids counting k-mers. Instead, it uses a pair of Bloom filters, one holding a sample of the input k-mers and the other holding k-mers likely to be correct. As long as the sampling fraction is adjusted in inverse proportion to the depth of sequencing, Bloom filter size can be held constant while maintaining near-constant accuracy. Lighter is parallelized, uses no secondary storage, and is both faster and more memory-efficient than competing approaches while achieving comparable accuracy.

Highlights

  • The cost and throughput of DNA sequencing have improved rapidly in the past several years [1], with recent advances reducing the cost of sequencing a single human genome at 30-fold coverage to around $1,000 [2]

  • SHREC [5] and HiTEC [6] build a suffix index of the input reads and locate errors by finding instances where a substring is followed by a character less often than expected

  • We simulated a collection of reads from the reference genome for the K12 strain of Escherichia coli (NC_000913.2) using Mason v0.1.2 [24]

Read more

Summary

Introduction

The cost and throughput of DNA sequencing have improved rapidly in the past several years [1], with recent advances reducing the cost of sequencing a single human genome at 30-fold coverage to around $1,000 [2]. Lighter uses a simple test applied to each position of each read to compile a set of solid k-mers, stored in a second Bloom filter. When Lighter is deciding whether a position is trusted, if its quality score is less than or equal to min{t1, t2 − 1}, it is called untrusted regardless of how many of the overlapping k-mers appear in Bloom filter A.

Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call