Taxonomic classification with maximal exact matches in KATKA kernels and minimizer digests.

Dominika Draesslerová,Omar Ahmed,Travis Gagie,Jan Holub,Ben Langmead,Giovanni Manzini,Gonzalo Navarro

doi:10.4230/lipics.sea.2024.10

Abstract

For taxonomic classification, we are asked to index the genomes in a phylogenetic tree such that later, given a DNA read, we can quickly choose a small subtree likely to contain the genome from which that read was drawn. Although popular classifiers such as Kraken use -mers, recent research indicates that using maximal exact matches (MEMs) can lead to better classifications. For example, we can ■ build an augmented FM-index over the the genomes in the tree concatenated in left-to-right order; ■ for each MEM in a read, find the interval in the suffix array containing the starting positions of that MEM's occurrences in those genomes; ■ find the minimum and maximum values stored in that interval; ■ take the lowest common ancestor (LCA) of the genomes containing the characters at those positions. This solution is practical, however, only when the total size of the genomes in the tree is fairly small. In this paper we consider applying the same solution to three lossily compressed representations of the genomes' concatenation: ■ a KATKA kernel, which discards characters that are not in the first or last occurrence of any -tuple, for a parameter ; a minimizer digest; ■ a KATKA kernel of a minimizer digest. With a test dataset and these three representations of it, simulated reads and various parameter settings, we checked how many reads' longest MEMs occurred only in the sequences from which those reads were generated ("true positive" reads). For some parameter settings we achieved significant compression while only slightly decreasing the true-positive rate.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Taxonomic classification with maximal exact matches in KATKA kernels and minimizer digests.

Abstract

Talk to us

Similar Papers

More From: LIPIcs : Leibniz international proceedings in informatics

Lead the way for us

Similar Papers

Finding maximal exact matches in graphs
Nicola Rizzo ... Veli Mäkinen
Algorithms for Molecular Biology | VOL. 19
Nicola Rizzo, et. al.Nicola Rizzo ... Veli Mäkinen
11 Mar 2024
Algorithms for Molecular Biology | VOL. 19

Extracting Maximal Exact Matches on GPU
Anas Abu-Doleh ... Kamer Kaya
-
Anas Abu-Doleh, et. al.Anas Abu-Doleh ... Kamer Kaya
01 May 2014
01 May 2014

A practical algorithm for finding maximal exact matches in large sequence datasets using sparse suffix arrays
Zia Khan ... Joshua S Bloom
Bioinformatics | VOL. 25
Zia Khan, et. al.Zia Khan ... Joshua S Bloom
23 Apr 2009
Bioinformatics | VOL. 25

A genomic distance for assembly comparison based on compressed maximal exact matches.
Sara P Garcia ... Armando J Pinho
IEEE/ACM Transactions on Computational Biology and Bioinformatics | VOL. 10
Sara P Garcia, et. al.Sara P Garcia ... Armando J Pinho
01 May 2013
IEEE/ACM Transactions on Computational Biology and Bioinformatics | VOL. 10

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Taxonomic classification with maximal exact matches in KATKA kernels and minimizer digests.

Abstract

Talk to us

Similar Papers

More From: LIPIcs : Leibniz international proceedings in informatics