Abstract

We propose three algorithms for string edit distance with duplications and contractions. These include an efficient general algorithm and two improvements which apply under certain constraints on the cost function. The new algorithms solve a more general problem variant and obtain better time complexities with respect to previous algorithms. Our general algorithm is based on min-plus multiplication of square matrices and has time and space complexities of O (|Σ|MP (n)) and O (|Σ|n2), respectively, where |Σ| is the alphabet size, n is the length of the strings, and MP (n) is the time bound for the computation of min-plus matrix multiplication of two n × n matrices (currently, due to an algorithm by Chan).For integer cost functions, the running time is further improved to . In addition, this variant of the algorithm is online, in the sense that the input strings may be given letter by letter, and its time complexity bounds the processing time of the first n given letters. This acceleration is based on our efficient matrix-vector min-plus multiplication algorithm, intended for matrices and vectors for which differences between adjacent entries are from a finite integer interval D. Choosing a constant , the algorithm preprocesses an n × n matrix in time and space. Then, it may multiply the matrix with any given n-length vector in time. Under some discreteness assumptions, this matrix-vector min-plus multiplication algorithm applies to several problems from the domains of context-free grammar parsing and RNA folding and, in particular, implies the asymptotically fastest time algorithm for single-strand RNA folding with discrete cost functions.Finally, assuming a different constraint on the cost function, we present another version of the algorithm that exploits the run-length encoding of the strings and runs in time and space, where is the length of the run-length encoding of the strings.

Highlights

  • Comparing strings is a well-studied problem in computer science as well as in bioinformatics

  • A baseline dynamic-programming algorithm for Edit Distance with Duplications and Contractions (EDDC) we describe a dynamic programming (DP) algorithm implementing the recursive EDDC computation given by Equations 1 to 9, which is the basis for improvements introduced later in this paper

  • An online algorithm for EDDC using min-plus matrix-vector multiplication for discrete cost functions we present an EDDC algorithm which is based on the general algorithm and improves its time complexity by a factor of O(log3 log n)

Read more

Summary

Background

Comparing strings is a well-studied problem in computer science as well as in bioinformatics. The EDDC algorithm presented is based on a Ddiscrete matrix-vector min-plus multiplication algorithm we developed, which is generic and may be applied to other problems as well. In Lemma 1, we show that in this case, all matrix multiplications applied by Algorithm 2 are between D-discrete metrices, with respect to a certain integer interval D This proof is similar to that of Masek and Paterson for simple edit distance [17]. Given strings s and t and an integer cost function for EDDC, all matrix multiplications applied by Algorithm 2 are over D-discrete matrices, where D = Ia,b is determined according to the cost function by a = − max{del(α) | α ∈ } and b = max{ins(α) | α ∈ } + 1.

We give an algorithm for min-plus
Conclusions and discussion
Waterman M
13. Williams R
19. Gusfield D
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call