Despite the widespread adoption of -mer-based methods in bioinformatics, understanding the influence of -mer sizes remains a persistent challenge. Selecting an optimal -mer size or employing multiple -mer sizes is often arbitrary, application-specific, and fraught with computational complexities. Typically, the influence of -mer size is obscured by the outputs of complex bioinformatics tasks, such as genome analysis, comparison, assembly, alignment, and error correction. However, it is frequently overlooked that every method is built above a well-defined -mer-based object like Jaccard Similarity, de Bruijn graphs, -mer spectra, and Bray-Curtis Dissimilarity. Despite these objects offering a clearer perspective on the role of -mer sizes, the dynamics of -mer-based objects with respect to -mer sizes remain surprisingly elusive. This paper introduces a computational framework that generalizes the transition of -mer-based objects across -mer sizes, utilizing a novel substring index, the Prokrustean graph. The primary contribution of this framework is to compute quantities associated with -mer-based objects for all -mer sizes, where the computational complexity depends solely on the number of maximal repeats and is independent of the range of -mer sizes. For example, counting vertices of compacted de Bruijn graphs for can be accomplished in mere seconds with our substring index constructed on a gigabase-sized read set. Additionally, we derive a space-efficient algorithm to extract the Prokrustean graph from the Burrows-Wheeler Transform. It becomes evident that modern substring indices, mostly based on longest common prefixes of suffix arrays, inherently face difficulties at exploring varying -mer sizes due to their limitations at grouping co-occurring substrings. We have implemented four applications that utilize quantities critical in modern pangenomics and metagenomics. The code for these applications and the construction algorithm is available at https://github.com/KoslickiLab/prokrustean.
Read full abstract