Abstract

Merging words according to their overlap yields a superstring. This basic operation allows to infer long strings from a collection of short pieces, as in genome assembly. To capture a maximum of overlaps, the goal is to infer the shortest superstring of a set of input words. The Shortest Cyclic Cover of Strings (SCCS) problem asks, instead of a single linear superstring, for a set of cyclic strings that contain the words as substrings and whose sum of lengths is minimal. SCCS is used as a crucial step in polynomial time approximation algorithms for the notably hard Shortest Superstring problem, but it is solved in cubic time. The cyclic strings are then cut and merged to build a linear superstring. SCCS can also be solved by a greedy algorithm. Here, we propose a linear time algorithm for solving SCCS based on a Eulerian graph that captures all greedy solutions in linear space. Because the graph is Eulerian, this algorithm can also find a greedy solution of SCCS with the least number of cyclic strings. This has implications for solving certain instances of the Shortest linear or cyclic Superstring problems.

Highlights

  • The possibility of merging two words into a longer one according to their overlap – for instance merging abcde with cdeba into abcdeba – allows to infer the sequence of a target molecule from the short reads produced by sequencing machines

  • The action of merging is heavily used in any genome assembler and given the redundancy of the sequencing, which yields a high density of reads, the objective is to compute a shortest superstring of a set of input words [10]

  • Blum et al [1] were the first to exhibit such an algorithm and to prove that it achieves a constant ratio of 3 for Shortest (linear) Superstring Problem (SLiS). This algorithm first solves Shortest Cyclic Cover of Strings (SCCS) and in a second step combines the cyclic strings by running a greedy algorithm on them

Read more

Summary

Introduction

The possibility of merging two words into a longer one according to their overlap – for instance merging abcde with cdeba into abcdeba – allows to infer the sequence of a target molecule from the short reads produced by sequencing machines. The research in the last 25 years has mainly focused on polynomial time approximation algorithms for SLiS (see [8] for a recent list) Despite these efforts, the algorithm greedy (Algorithm 1), an algorithm that iteratively updates the input set by merging two maximally overlapping strings until one string is left, was conjectured. A shortest linear superstring is associated with a Hamiltonian path on this graph, and a shortest cyclic superstring with a Hamiltonian cycle [1] Those problems are hard to approximate and most approximation algorithms for SLiS resort to a relaxed question, namely Shortest Cyclic Cover of Strings (SCCS). We show that it gives, for a subset of instances of the shortest linear and cyclic superstring problems, interesting approximations or even exact solutions (Section 4)

Notation and problem definition
Related works
Explanations about Algorithm greedy
Overlaps
Merge and Red-Blue Paths
Equivalence of greedy solutions
Characterisation of the Superstring Graph and a construction algorithm
SCCS with a cardinality constraint on greedy solutions
Consequences of the topology of the Superstring Graph
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call