Linear Time Algorithms for Generalizations of the Longest Common Substring Problem

Michael Arnold,Enno Ohlebusch

doi:10.1007/s00453-009-9369-1

Abstract

In its simplest form, the longest common substring problem is to find a longest substring common to two or multiple strings. Using (generalized) suffix trees, this problem can be solved in linear time and space. A first generalization is the k -common substring problem: Given m strings of total length n, for all k with 2≤k≤m simultaneously find a longest substring common to at least k of the strings. It is known that the k-common substring problem can also be solved in O(n) time (Hui in Proc. 3rd Annual Symposium on Combinatorial Pattern Matching, volume 644 of Lecture Notes in Computer Science, pp. 230–243, Springer, Berlin, 1992). A further generalization is the k -common repeated substring problem: Given m strings T (1),T (2),…,T (m) of total length n and m positive integers x 1,…,x m , for all k with 1≤k≤m simultaneously find a longest string ω for which there are at least k strings $T^{(i_{1})},T^{(i_{2})},\ldots,T^{(i_{k})}$ (1≤i 1<i 2<⋅⋅⋅<i k ≤m) such that ω occurs at least $x_{i_{j}}$ times in $T^{(i_{j})}$ for each j with 1≤j≤k. (For x 1=⋅⋅⋅=x m =1, we have the k-common substring problem.) In this paper, we present the first O(n) time algorithm for the k-common repeated substring problem. Our solution is based on a new linear time algorithm for the k-common substring problem.

Full Text