Abstract

BackgroundSpecialized data structures are required for online algorithms to efficiently handle large sequencing datasets. The counting quotient filter (CQF), a compact hashtable, can efficiently store k-mers with a skewed distribution.ResultHere, we present the mixed-counters quotient filter (MQF) as a new variant of the CQF with novel counting and labeling systems. The new counting system adapts to a wider range of data distributions for increased space efficiency and is faster than the CQF for insertions and queries in most of the tested scenarios. A buffered version of the MQF can offload storage to disk, trading speed of insertions and queries for a significant memory reduction. The labeling system provides a flexible framework for assigning labels to member items while maintaining good data locality and a concise memory representation. These labels serve as a minimal perfect hash function but are ~ tenfold faster than BBhash, with no need to re-analyze the original data for further insertions or deletions.ConclusionsThe MQF is a flexible and efficient data structure that extends our ability to work with high throughput sequencing data.

Highlights

  • Specialized data structures are required for online algorithms to efficiently handle large sequencing datasets

  • The mixed-counters quotient filter (MQF) is a flexible and efficient data structure that extends our ability to work with high throughput sequencing data

  • The Bloom filter supports approximate set membership queries with a predefined false positive rate (FPR) [3]

Read more

Summary

Introduction

Specialized data structures are required for online algorithms to efficiently handle large sequencing datasets. The counting quotient filter (CQF), a compact hashtable, can efficiently store k-mers with a skewed distribution. Approximate data structures are commonly used in online algorithms to provide better average space and time efficiency [2]. The Bloom filter supports approximate set membership queries with a predefined false positive rate (FPR) [3]. The count-min sketch (CMS) is similar to Bloom filters and can be used to count items with a tunable rate of overestimation. There are a number of problems with Bloom filters and the CMS—in particular, they do not support data locality. The counting quotient filter (CQF) is a more efficient data structure that serves similar purposes with better efficiency for skewed distributions and much better data locality

Methods
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call