Abstract

We consider the following problem: given a collection of strings s 1 ,…, s m , find the shortest string s such that each s i appears as a substring (a consecutive block) of s . Although this problem is known to be NP-hard, a simple greedy procedure appears to do quite well and is routinely used in DNA sequencing and data compression practice, namely: repeatedly merge the pair of (distinct) strings with maximum overlap until only one string remains. Let n denote the length of the optimal superstring. A common conjecture states that the above greedy procedure produces a superstring of length O(n) (in fact, 2 n ), yet the only previous nontrivial bound known for any polynomial-time algorithm is a recent O(n log n ) result. We show that the greedy algorithm does in fact achieve a constant factor approximation, proving an upper bound of 4 n . Furthermore, we present a simple modified version of the greedy algorithm that we show produces a superstring of length at most 3 n . We also show the superstring problem to be MAXSNP-hard, which implies that a polynomial-time approximation scheme for this problem is unlikely.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call