MQF and buffered MQF: quotient filters for efficient storage of k-mers with their counts and metadata

Moustafa Shokrof,C Titus Brown,Tamer A Mansour

doi:10.1186/s12859-021-03996-x

Moustafa Shokrof, C Titus Brown + Show 1 more

Open Access

https://doi.org/10.1186/s12859-021-03996-x

Copy DOI

Abstract

BackgroundSpecialized data structures are required for online algorithms to efficiently handle large sequencing datasets. The counting quotient filter (CQF), a compact hashtable, can efficiently store k-mers with a skewed distribution.ResultHere, we present the mixed-counters quotient filter (MQF) as a new variant of the CQF with novel counting and labeling systems. The new counting system adapts to a wider range of data distributions for increased space efficiency and is faster than the CQF for insertions and queries in most of the tested scenarios. A buffered version of the MQF can offload storage to disk, trading speed of insertions and queries for a significant memory reduction. The labeling system provides a flexible framework for assigning labels to member items while maintaining good data locality and a concise memory representation. These labels serve as a minimal perfect hash function but are ~ tenfold faster than BBhash, with no need to re-analyze the original data for further insertions or deletions.ConclusionsThe MQF is a flexible and efficient data structure that extends our ability to work with high throughput sequencing data.

Highlights

Specialized data structures are required for online algorithms to efficiently handle large sequencing datasets
The mixed-counters quotient filter (MQF) is a flexible and efficient data structure that extends our ability to work with high throughput sequencing data
The Bloom filter supports approximate set membership queries with a predefined false positive rate (FPR) [3]

Summary

Introduction

Specialized data structures are required for online algorithms to efficiently handle large sequencing datasets. The counting quotient filter (CQF), a compact hashtable, can efficiently store k-mers with a skewed distribution. Approximate data structures are commonly used in online algorithms to provide better average space and time efficiency [2]. The Bloom filter supports approximate set membership queries with a predefined false positive rate (FPR) [3]. The count-min sketch (CMS) is similar to Bloom filters and can be used to count items with a tunable rate of overestimation. There are a number of problems with Bloom filters and the CMS—in particular, they do not support data locality. The counting quotient filter (CQF) is a more efficient data structure that serves similar purposes with better efficiency for skewed distributions and much better data locality

Methods

Results

Discussion

Conclusion