Longest Common Extension Research Articles

The longest common extension (LCE) of two indices in a string is the length of the longest identical substrings starting at these two indices. The LCE problem asks to preprocess a string into a compact data structure that supports fast LCE queries.In this paper we generalize the LCE problem to trees and suggest a few applications of LCE in trees to tries and XML databases. Given a labeled and rooted tree T of size n, the goal is to preprocess T into a compact data structure that support the following LCE queries between subpaths and subtrees in T. Let v1, v2, w1, and w2 be nodes of T such that w1 and w2 are descendants of v1 and v2 respectively.•LCEPP(v1,w1,v2,w2): (path–path LCE) return the longest common prefix of the paths v1↝w1 and v2↝w2.•LCEPT(v1,w1,v2): (path–tree LCE) return maximal path–path LCE of the path v1↝w1 and any path from v2 to a descendant leaf.•LCETT(v1,v2): (tree–tree LCE) return a maximal path–path LCE of any pair of paths from v1 and v2 to descendant leaves. We present the first non-trivial bounds for supporting these queries. For LCEPP queries, we present a linear-space solution with O(log⁎⁡n) query time. For LCEPT queries, we present a linear-space solution with O((log⁡log⁡n)2) query time, and complement this with a lower bound showing that any path–tree LCE structure of size O(npolylog(n)) must necessarily use Ω(log⁡log⁡n) time to answer queries. For LCETT queries, we present a time-space trade-off, that given any parameter τ, 1≤τ≤n, leads to an O(nτ) space and O(n/τ) query-time solution (all of these bounds hold on a standard unit-cost RAM model with logarithmic word size). This is complemented with a reduction from the set intersection problem implying that a fast linear space solution is not likely to exist.

Read full abstract

BackgroundChaos Game Representation (CGR) is an iterated function that bijectively maps discrete sequences into a continuous domain. As a result, discrete sequences can be object of statistical and topological analyses otherwise reserved to numerical systems. Characteristically, CGR coordinates of substrings sharing an L-long suffix will be located within 2-L distance of each other. In the two decades since its original proposal, CGR has been generalized beyond its original focus on genomic sequences and has been successfully applied to a wide range of problems in bioinformatics. This report explores the possibility that it can be further extended to approach algorithms that rely on discrete, graph-based representations.ResultsThe exploratory analysis described here consisted of selecting foundational string problems and refactoring them using CGR-based algorithms. We found that CGR can take the role of suffix trees and emulate sophisticated string algorithms, efficiently solving exact and approximate string matching problems such as finding all palindromes and tandem repeats, and matching with mismatches. The common feature of these problems is that they use longest common extension (LCE) queries as subtasks of their procedures, which we show to have a constant time solution with CGR. Additionally, we show that CGR can be used as a rolling hash function within the Rabin-Karp algorithm.ConclusionsThe analysis of biological sequences relies on algorithmic foundations facing mounting challenges, both logistic (performance) and analytical (lack of unifying mathematical framework). CGR is found to provide the latter and to promise the former: graph-based data structures for sequence analysis operations are entailed by numerical-based data structures produced by CGR maps, providing a unifying analytical framework for a diversity of pattern matching problems.

Read full abstract

Longest Common Extension Research Articles

Related Topics

Articles published on Longest Common Extension

Augmented Thresholds for MONI.

On the longest common prefix of suffixes in an inverse Lyndon factorization and other properties

Faster Lyndon factorization algorithms for SLP and LZ78 compressed text

Longest common extensions in trees

Fast Algorithms for Local Similarity Queries in Two Sequences

Time–space trade-offs for longest common extensions

Pattern matching through Chaos Game Representation: bridging numerical and discrete data structures for biological sequence analysis

The longest common extension problem revisited and applications to approximate string searching

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Longest Common Extension Research Articles

Related Topics

Articles published on Longest Common Extension

Augmented Thresholds for MONI.

On the longest common prefix of suffixes in an inverse Lyndon factorization and other properties

Faster Lyndon factorization algorithms for SLP and LZ78 compressed text

Longest common extensions in trees

Fast Algorithms for Local Similarity Queries in Two Sequences

Time–space trade-offs for longest common extensions

Pattern matching through Chaos Game Representation: bridging numerical and discrete data structures for biological sequence analysis

The longest common extension problem revisited and applications to approximate string searching