A randomized optimal k-mer indexing approach for efficient parallel genome sequence compression

Subhankar Roy,Anirban Mukhopadhyay

doi:10.1016/j.gene.2024.148235

Abstract

Next Generation Sequencing (NGS) technology generates massive amounts of genome sequence that increases rapidly over time. As a result, there is a growing need for efficient compression algorithms to facilitate the processing, storage, transmission, and analysis of large-scale genome sequences. Over the past 31 years, numerous state-of-the-art compression algorithms have been developed. The performance of any compression algorithm is measured by three main compression metrics: compression ratio, time, and memory usage. Existing k-mer hash indexing systems take more time, due to the decision-making process based on compression results. In this paper, we propose a two-phase reference genome compression algorithm using optimal k-mer length (RGCOK). Reference-based compression takes advantage of the inter-similarity between chromosomes of the same species. RGCOK achieves this by finding the optimal k-mer length for matching, using a randomization method and hashing. The performance of RGCOK was evaluated on three different benchmark data sets: novel severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), Homo sapiens, and other species sequences using an Amazon AWS virtual cloud machine. Experiments showed that the optimal k-mer finding time by RGCOK is around 45.28 min, whereas the time for existing state-of-the-art algorithms HiRGC, SCCG, and HRCM ranges from 58 min to 8.97 h.

Full Text