A general near-exact k-mer counting method with low memory consumption enables de novo assembly of 106× human sequence data in 2.7 hours.

Christina Huan Shi,Kevin Y Yip

doi:10.1093/bioinformatics/btaa890

Abstract

In de novo sequence assembly, a standard pre-processing step is k-mer counting, which computes the number of occurrences of every length-k sub-sequence in the sequencing reads. Sequencing errors can produce many k-mers that do not appear in the genome, leading to the need for an excessive amount of memory during counting. This issue is particularly serious when the genome to be assembled is large, the sequencing depth is high, or when the memory available is limited. Here, we propose a fast near-exact k-mer counting method, CQF-deNoise, which has a module for dynamically removing noisy false k-mers. It automatically determines the suitable time and number of rounds of noise removal according to a user-specified wrong removal rate. We tested CQF-deNoise comprehensively using data generated from a diverse set of genomes with various data properties, and found that the memory consumed was almost constant regardless of the sequencing errors while the noise removal procedurehad minimal effects on counting accuracy. Compared with four state-of-the-art k-mer counting methods, CQF-deNoise consistently performed the best in terms of memory usage, consuming 49-76% less memory than thesecond best method. When counting the k-mers from a human dataset with around 60× coverage, the peakmemory usage of CQF-deNoise was only 10.9GB (gigabytes) for k = 28 and 21.5GB for k = 55. De novo assembly of 106× human sequencing data using CQF-deNoise for k-mer counting required only 2.7 h and 90GB peak memory. The source codes of CQF-deNoise and SH-assembly are available at https://github.com/Christina-hshi/CQF-deNoise.git and https://github.com/Christina-hshi/SH-assembly.git, respectively, both under the BSD 3-Clause license.

Full Text