Efficient counting of k-mers in DNA sequences using a bloom filter

Páll Melsted,Jonathan K Pritchard

doi:10.1186/1471-2105-12-333

Páll Melsted, Jonathan K Pritchard

Open Access

https://doi.org/10.1186/1471-2105-12-333

Copy DOI

Abstract

BackgroundCounting k-mers (substrings of length k in DNA sequence data) is an essential component of many methods in bioinformatics, including for genome and transcriptome assembly, for metagenomic sequencing, and for error correction of sequence reads. Although simple in principle, counting k-mers in large modern sequence data sets can easily overwhelm the memory capacity of standard computers. In current data sets, a large fraction-often more than 50%-of the storage capacity may be spent on storing k-mers that contain sequencing errors and which are typically observed only a single time in the data. These singleton k-mers are uninformative for many algorithms without some kind of error correction.ResultsWe present a new method that identifies all the k-mers that occur more than once in a DNA sequence data set. Our method does this using a Bloom filter, a probabilistic data structure that stores all the observed k-mers implicitly in memory with greatly reduced memory requirements. We then make a second sweep through the data to provide exact counts of all nonunique k-mers. For example data sets, we report up to 50% savings in memory usage compared to current software, with modest costs in computational speed. This approach may reduce memory requirements for any algorithm that starts by counting k-mers in sequence data with errors.ConclusionsA reference implementation for this methodology, BFCounter, is written in C++ and is GPL licensed. It is available for free download at http://pritch.bsd.uchicago.edu/bfcounter.html

Highlights

Counting k-mers is an essential component of many methods in bioinformatics, including for genome and transcriptome assembly, for metagenomic sequencing, and for error correction of sequence reads
The Bloom Filter The Bloom filter is a probabilistic data structure supporting dynamic set membership queries with false positives [16]
Counting k-mers from sequencing data is an essential component of many recent methods for genome

Summary

Results

We present a new method that identifies all the k-mers that occur more than once in a DNA sequence data set. Our method does this using a Bloom filter, a probabilistic data structure that stores all the observed kmers implicitly in memory with greatly reduced memory requirements. We report up to 50% savings in memory usage compared to current software, with modest costs in computational speed. This approach may reduce memory requirements for any algorithm that starts by counting k-mers in sequence data with errors

Background

Results and Discussion

Conclusions

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: BMC Bioinformatics	Publication Date: Aug 10, 2011
Citations: 278	License type: CC BY 2.0

R Discovery Prime

R Discovery Prime

Efficient counting of k-mers in DNA sequences using a bloom filter

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Bioinformatics

Lead the way for us

Similar Papers

Phylogeny of Saururaceae Based on Morphology and Five Regions from Three Plant Genomes
Shao-Wu Meng ... Andrew W Douglas
Annals of the Missouri Botanical Garden | VOL. 90
Shao-Wu Meng, et. al.Shao-Wu Meng ... Andrew W Douglas
01 Jan 2003
Annals of the Missouri Botanical Garden | VOL. 90

Probabilistic Data Structures in Adversarial Environments
David Clayton ... Christopher Patton
-
David Clayton, et. al.David Clayton ... Christopher Patton
06 Nov 2019
06 Nov 2019

Ontogeny and Phylogeny in the Northern Swordtail Clade of Xiphophorus
Jeffrey M Marcus ... D Cannatella
Systematic Biology | VOL. 48
Jeffrey M Marcus, et. al.Jeffrey M Marcus ... D Cannatella
01 Jul 1999
Systematic Biology | VOL. 48

A fast, lock-free approach for efficient parallel counting of occurrences of k-mers
Guillaume Marçais ... Carl Kingsford
Bioinformatics | VOL. 27
Guillaume Marçais, et. al.Guillaume Marçais ... Carl Kingsford
07 Jan 2011
Bioinformatics | VOL. 27

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Efficient counting of k-mers in DNA sequences using a bloom filter

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Bioinformatics