Gerbil: a fast and memory-efficient\xa0k-mer counter with GPU-support

Marius Erbert,Matthias Müller-Hannemann,Steffen Rechner

doi:10.1186/s13015-017-0097-9

Marius Erbert, Matthias Müller-Hannemann + Show 1 more

Open Access

https://doi.org/10.1186/s13015-017-0097-9

Copy DOI

Abstract

BackgroundA basic task in bioinformatics is the counting of k-mers in genome sequences. Existing k-mer counting tools are most often optimized for small k < 32 and suffer from excessive memory resource consumption or degrading performance for large k. However, given the technology trend towards long reads of next-generation sequencers, support for large k becomes increasingly important.ResultsWe present the open source k-mer counting software Gerbil that has been designed for the efficient counting of k-mers for k ≥ 32. Our software is the result of an intensive process of algorithm engineering. It implements a two-step approach. In the first step, genome reads are loaded from disk and redistributed to temporary files. In a second step, the k-mers of each temporary file are counted via a hash table approach. In addition to its basic functionality, Gerbil can optionally use GPUs to accelerate the counting step. In a set of experiments with real-world genome data sets, we show that Gerbil is able to efficiently support both small and large k.ConclusionsWhile Gerbil’s performance is comparable to existing state-of-the-art open source k-mer counting tools for small k < 32, it vastly outperforms its competitors for large k, thereby enabling new applications which require large values of k.

Highlights

A basic task in bioinformatics is the counting of k-mers in genome sequences
In the main part of this article, we focus on algorithm engineering aspects that proved essential for high performance and describe details, like the integration of a GPU into the counting process
Experimental setup We tested our implementation in a set of experiments

Summary

Results

For each of our test data sets we counted the k-mers for a set of different k and compared Gerbil’s running time with those of KMC2 in version 2.3.0 and DSK in version 2.0.7. We used a synthesized test set GRCh38, created from Genome Reference Consortium Human Reference 38, from which we uniformly sampled k-mers of size 1000 The purpose of these data sets is to have longer reads allowing to test the performance for larger values of k. Memory and disk space We gain some additional interesting insights when we take a closer look into Table 4 that shows detailed information on running time and memory usage of each tool. Gerbil typically uses much less memory due to its dynamic prediction of the hash table size Both KMC2 and DSK use a significantly larger amount of main memory. A small disk space consumption is essential since disk operations are far more expensive than the actual counting

Conclusions

Background

2: Input: k-mer x