Disk-based k-mer counting on a PC

Sebastian Deorowicz,Agnieszka Debudaj-Grabysz,Szymon Grabowski

doi:10.1186/1471-2105-14-160

Sebastian Deorowicz, Agnieszka Debudaj-Grabysz + Show 1 more

Open Access

https://doi.org/10.1186/1471-2105-14-160

Copy DOI

Abstract

BackgroundThe k-mer counting problem, which is to build the histogram of occurrences of every k-symbol long substring in a given text, is important for many bioinformatics applications. They include developing de Bruijn graph genome assemblers, fast multiple sequence alignment and repeat detection.ResultsWe propose a simple, yet efficient, parallel disk-based algorithm for counting k-mers. Experiments show that it usually offers the fastest solution to the considered problem, while demanding a relatively small amount of memory. In particular, it is capable of counting the statistics for short-read human genome data, in input gzipped FASTQ file, in less than 40 minutes on a PC with 16 GB of RAM and 6 CPU cores, and for long-read human genome data in less than 70 minutes. On a more powerful machine, using 32 GB of RAM and 32 CPU cores, the tasks are accomplished in less than half the time. No other algorithm for most tested settings of this problem and mammalian-size data can accomplish this task in comparable time. Our solution also belongs to memory-frugal ones; most competitive algorithms cannot efficiently work on a PC with 16 GB of memory for such massive data.ConclusionsBy making use of cheap disk space and exploiting CPU and I/O parallelism we propose a very competitive k-mer counting procedure, called KMC. Our results suggest that judicious resource management may allow to solve at least some bioinformatics problems with massive data on a commodity personal computer.

Highlights

The k-mer counting problem, which is to build the histogram of occurrences of every k-symbol long substring in a given text, is important for many bioinformatics applications
The most popular assembly approach for such data is based on building the de Bruijn graph [1], in which an edge between any pair of k-mers, represented as nodes in the graph, exists if and only if the (k − 1)-symbol long suffix of one k-mer is a prefix of another
The other two, Tallymer [3] and Meryl from the Celera assembler [5], were tested in [6], on a 1 GB turkey genome, and we can find the following statement in the cited work: Jellyfish is able to count 22-mers at coverage > 10× where the other programs fail or take over 5 h. This makes them hard to use on human genome data, with 30-fold coverage

Summary

Introduction

The k-mer counting problem, which is to build the histogram of occurrences of every k-symbol long substring in a given text, is important for many bioinformatics applications. They include developing de Bruijn graph genome assemblers, fast multiple sequence alignment and repeat detection. Counting the number of occurrences of every substring of length k (so-called k-mer) in a given string S is an important procedure in bioinformatics. The current sequencing technology cannot, get rid of a relatively large number of errors (mis-detected nucleotides) in sequence reads. These errors can be detected on a statistical basis. We should not distinguish between a k-mer and its reversed complement, and by the “canonical k-mer” we will mean the lexicographically smaller of the two

Objectives

Methods

Results