KmerCo: A lightweight K-mer counting technique with a tiny memory footprint

Sabuzima Nayak ,Ripon Patgiri

doi:10.48550/arxiv.2305.07545

Abstract

K-mer counting is a requisite process for DNA assembly because it speeds up its overall process. The frequency of K-mers is used for estimating the parameters of DNA assembly, error correction, etc. The process also provides a list of district K-mers which assist in searching large databases and reducing the size of de Bruijn graphs. Nonetheless, K-mer counting is a data and compute-intensive process. Hence, it is crucial to implement a lightweight data structure that occupies low memory but does fast processing of K-mers. We proposed a lightweight K-mer counting technique, called KmerCo that implements a potent counting Bloom Filter variant, called countBF. KmerCo has two phases: insertion and classification. The insertion phase inserts all K-mers into countBF and determines distinct K-mers. The classification phase is responsible for the classification of distinct K-mers into trustworthy and erroneous K-mers based on a user-provided threshold value. We also proposed a novel benchmark performance metric. We used the Hadoop MapReduce program to determine the frequency of K-mers. We have conducted rigorous experiments to prove the dominion of KmerCo compared to state-of-the-art K-mer counting techniques. The experiments are conducted using DNA sequences of four organisms. The datasets are pruned to generate four different size datasets. KmerCo is compared with Squeakr, BFCounter, and Jellyfish. KmerCo took the lowest memory, highest number of insertions per second, and a positive trustworthy rate as compared with the three above-mentioned methods.

Full Text