KmcEx: memory-frugal and retrieval-efficient encoding of counted k-mers.

Peng Jiang,Xiangjun Tang,Bertil Schmidt,Ningjiang Chen,Jie Luo,Liang Zhao,Limsoon Wong,Yiqi Wang,Pingji Deng,Bonnie Berger

doi:10.1093/bioinformatics/btz299

Abstract

K-mers along with their frequency have served as an elementary building block for error correction, repeat detection, multiple sequence alignment, genome assembly, etc., attracting intensive studies in k-mer counting. However, the output of k-mer counters itself is large; very often, it is too large to fit into main memory, leading to highly narrowed usability. We introduce a novel idea of encoding k-mers as well as their frequency, achieving good memory saving and retrieval efficiency. Specifically, we propose a Bloom filter-like data structure to encode counted k-mers by coupled-bit arrays-one for k-mer representation and the other for frequency encoding. Experiments on five real datasets show that the average memory-saving ratio on all 31-mers is as high as 13.81 as compared with raw input, with 7 hash functions. At the same time, the retrieval time complexity is well controlled (effectively constant), and the false-positive rate is decreased by two orders of magnitude. The source codes of our algorithm are available at github.com/lzhLab/kmcEx. Supplementary data are available at Bioinformatics online.

Full Text