A [formula omitted] superstring approximation algorithm

Chris Armen,Clifford Stein

doi:10.1016/s0166-218x(98)00065-1

Chris Armen, Clifford Stein

Open Access

https://doi.org/10.1016/s0166-218x(98)00065-1

Copy DOI

Abstract

Given a collection of strings ifS = s 1, …, s n over an alphabet ∑, a superstring α of S is a string containing each s i , as a substring; that is, for each i, 1⩽ i ⩽ n, α contains a block of ¦s i¦ consecutive characters that match s i exactly. The shortest superstring problem is the problem of finding a superstring α of minimum length. The shortest superstring problem has applications in both data compression and computational biology. It was shown by Blum et al. (1994) to be MAX SNP-hard. The first O(1)-approximation algorithm also appeared in Blum et al. (1994), which returns a superstring no more than 3 times the length of an optimal solution. Prior to the algorithm described in this paper, there were several published results that improved on the approximation ratio; of these, the best was our algorithm ShortString, a 2 3 4 - approximation Armen and Stein (1995). We present our new algorithm, G-ShortString, which achieves an approximation ratio of 2 2 3 . Our approach builds on the work in Armen and Stein (1995) in which we identified classes of strings that have a nested periodic structure, and which must be present in the worst case for our algorithms. We introduced machinery to describe these strings and proved strong structural properties about them. In this paper we extend this study to strings that exhibit a more relaxed form of the same structure, and we use this understanding to obtain our improved result.

Full Text