On weighted k-mer dictionaries

Giulio Ermanno Pibiri

doi:10.1186/s13015-023-00226-2

Abstract

We consider the problem of representing a set of k-mers and their abundance counts, or weights, in compressed space so that assessing membership and retrieving the weight of a k-mer is efficient. The representation is called a weighted dictionary of k-mers and finds application in numerous tasks in Bioinformatics that usually count k-mers as a pre-processing step. In fact, k-mer counting tools produce very large outputs that may result in a severe bottleneck for subsequent processing. In this work we extend the recently introduced SSHash dictionary (Pibiri in Bioinformatics 38:185–194, 2022) to also store compactly the weights of the k-mers. From a technical perspective, we exploit the order of the k-mers represented in SSHash to encode runs of weights, hence allowing much better compression than the empirical entropy of the weights. We study the problem of reducing the number of runs in the weights to improve compression even further and give an optimal algorithm for this problem. Lastly, we corroborate our findings with experiments on real-world datasets and comparison with competitive alternatives. Up to date, SSHash is the only k-mer dictionary that is exact, weighted, associative, fast, and small.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Algorithms for Molecular Biology	Publication Date: Jun 17, 2023
Citations: 7	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

On weighted k-mer dictionaries

Abstract

Talk to us

Similar Papers

More From: Algorithms for Molecular Biology

Lead the way for us

Similar Papers

Gerbil: a fast and memory-efficient\xa0k-mer counter with GPU-support
Marius Erbert ... Steffen Rechner
Algorithms for Molecular Biology | VOL. 12
Marius Erbert, et. al.Marius Erbert ... Steffen Rechner
31 Mar 2017
Algorithms for Molecular Biology | VOL. 12

Hardware Accelerator for BLAST
Shizuka Ishikawa ... Asuka Tanaka
-
Shizuka Ishikawa, et. al.Shizuka Ishikawa ... Asuka Tanaka
01 Sep 2012
01 Sep 2012

NNFSRR: Nearest Neighbor Feature Selection and Redundancy Removal Method for Nearest Neighbor Search in Microarray Gene Expression Data
Rupali Bhartiya ... Gend Lal Prajapati
EAI Endorsed Transactions on Pervasive Health and Technology | VOL. 9
Rupali Bhartiya, et. al.Rupali Bhartiya ... Gend Lal Prajapati
19 Sep 2023
EAI Endorsed Transactions on Pervasive Health and Technology | VOL. 9

Importance of data preprocessing in time series prediction using SARIMA: A case study
Amir Hossein Adineh ... Zahra Narimani
International Journal of Knowledge-based and Intelligent Engineering Systems | VOL. 24
Amir Hossein Adineh, et. al.Amir Hossein Adineh ... Zahra Narimani
18 Jan 2021
International Journal of Knowledge-based and Intelligent Engineering Systems | VOL. 24

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

On weighted k-mer dictionaries

Abstract

Talk to us

Similar Papers

More From: Algorithms for Molecular Biology