An average-case efficient two-stage algorithm for enumerating all longest common substrings of minimum length between genome pairs.

Mattia Prosperi,Simone Marini,Christina Boucher

doi:10.1109/ichi61247.2024.00020

Mattia Prosperi, Simone Marini + Show 1 more

https://doi.org/10.1109/ichi61247.2024.00020

Copy DOI

Export

Save

Cite

Abstract
Full-Text
Similar Papers

Abstract

Listen

A problem extension of the longest common substring (LCS) between two texts is the enumeration of all LCSs given a minimum length (ALCS- ), along with their positions in each text. In bioinformatics, an efficient solution to the ALCS- for very long texts -genomes or metagenomes- can provide useful insights to discover genetic signatures responsible for biological mechanisms. The ALCS- problem has two additional requirements compared to the LCS problem: one is the minimum length , and the other is that all common strings longer than must be reported. We present an efficient, two-stage ALCS- algorithm exploiting the spectrum of text substrings of length ( -mers). Our approach yields a worst-case time complexity loglinear in the number of -mers for the first stage, and an average-case loglinear in the number of common -mers for the second stage (several orders of magnitudes smaller than the total -mer spectrum). The space complexity is linear in the first phase (disk-based), and on average linear in the second phase (disk- and memory-based). Tests performed on genomes for different organisms (including viruses, bacteria and animal chromosomes) show that run times are consistent with our theoretical estimates; further, comparisons with MUMmer4 show an asymptotic advantage with divergent genomes.

Full Text