Lighter: fast and memory-efficient sequencing error correction without counting

Li Song,Ben Langmead,Liliana Florea

doi:10.1186/preaccept-9663167051308943

Li Song, Ben Langmead + Show 1 more

Open Access

https://doi.org/10.1186/preaccept-9663167051308943

Copy DOI

Abstract

Lighter is a fast, memory-efficient tool for correcting sequencing errors. Lighter avoids counting k-mers. Instead, it uses a pair of Bloom filters, one holding a sample of the input k-mers and the other holding k-mers likely to be correct. As long as the sampling fraction is adjusted in inverse proportion to the depth of sequencing, Bloom filter size can be held constant while maintaining near-constant accuracy. Lighter is parallelized, uses no secondary storage, and is both faster and more memory-efficient than competing approaches while achieving comparable accuracy.

Highlights

The cost and throughput of DNA sequencing have improved rapidly in the past several years [1], with recent advances reducing the cost of sequencing a single human genome at 30-fold coverage to around $1,000 [2]
SHREC [5] and HiTEC [6] build a suffix index of the input reads and locate errors by finding instances where a substring is followed by a character less often than expected
We simulated a collection of reads from the reference genome for the K12 strain of Escherichia coli (NC_000913.2) using Mason v0.1.2 [24]

Summary

Introduction

The cost and throughput of DNA sequencing have improved rapidly in the past several years [1], with recent advances reducing the cost of sequencing a single human genome at 30-fold coverage to around $1,000 [2]. Lighter uses a simple test applied to each position of each read to compile a set of solid k-mers, stored in a second Bloom filter. When Lighter is deciding whether a position is trusted, if its quality score is less than or equal to min{t1, t2 − 1}, it is called untrusted regardless of how many of the overlapping k-mers appear in Bloom filter A.

Results

Conclusion