Auxiliary Data Structures Research Articles

This paper re-examines, in a unified framework, two classic approaches to the problem of finding a longest common subsequence (LCS) of two strings, and proposes faster implementations for both. Letl be the length of an LCS between two strings of lengthm andn ≥m, respectively, and let s be the alphabet size. The first revised strategy follows the paradigm of a previousO(ln) time algorithm by Hirschberg. The new version can be implemented in timeO(lm · min logs, logm, log(2n/m)), which is profitable when the input strings differ considerably in size (a looser bound for both versions isO(mn)). The second strategy improves on the Hunt-Szymanski algorithm. This latter takes timeO((r +n) logn), wherer≤mn is the total number of matches between the two input strings. Such a performance is quite good (O(n logn)) whenr∼n, but it degrades to Θ(mn logn) in the worst case. On the other hand the variation presented here is never worse than linear-time in the productmn. The exact time bound derived for this second algorithm isO(m logn +d log(2mn/d)), whered ≤r is the number ofdominant matches (elsewhere referred to asminimal candidates) between the two strings. Both algorithms require anO(n logs) preprocessing that is nearly standard for the LCS problem, and they make use of simple and handy auxiliary data structures.

Read full abstract

Among the algorithms set up to date for finding the longest common subsequence of two strings, the one by Hunt and Szymanski exhibits the best known performance in favorable cases, but can be worse than any straightforward algorithm for a large variety of inputs. The new algorithm presented here pursues a schedule of primitive operations quite close to the one inherent to the Hunt-Szymanski strategy, but with substantially enhanced efficiency. In fact, the new algorithm improves on the former in two important respects. First, its worst case is never worse than linear in the product nm of the lengths of the two input strings. Second, its time bound does not always grow with the cardinality r of the set R of all pairs of matching positions of the input strings. Rather, it depends on the cardinality d of a specific subset of R, whose elements are called here dominant matches, and are elsewhere referred to as minimal candidates. This second improvement also appears of significance, since it seems that whenever r gets too close to mn, this forces d to be linear in m. The new algorithm requires standard preprocessing, and makes use of finger-trees. In a forthcoming paper, it will be shown among other things that the same performance can be achieved with simpler and handier auxiliary data structures.

Read full abstract

Auxiliary Data Structures Research Articles

Related Topics

Articles published on Auxiliary Data Structures

The longest common subsequence problem revisited

Improving the worst-case performance of the Hunt-Szymanski strategy for the longest common subsequence of two strings

ARTS: Accelerated Ray-Tracing System

Large Software Problems for Small Computers: An Example from Medical Imaging

Monte Carlo study of weighted percolation clusters relevant to the Potts models

The Theory and Practice of Constructing an Optimal Polyphase Sort

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Auxiliary Data Structures Research Articles

Related Topics

Articles published on Auxiliary Data Structures

The longest common subsequence problem revisited

Improving the worst-case performance of the Hunt-Szymanski strategy for the longest common subsequence of two strings

ARTS: Accelerated Ray-Tracing System

Large Software Problems for Small Computers: An Example from Medical Imaging

Monte Carlo study of weighted percolation clusters relevant to the Potts models

The Theory and Practice of Constructing an Optimal Polyphase Sort