DNA sequencing and string learning

Tao Jiang,Ming Li

doi:10.1007/bf01192694

Abstract

In laboratories the majority of large-scale DNA sequencing is done following theshotgun strategy, which is to sequence large amount of relatively short fragments randomly and then heuristically find a shortest common superstring of the fragments [26]. We study mathematical frameworks, under plausible assumptions, suitable for massive automated DNA sequencing and for analyzing DNA sequencing algorithms. We model the DNA sequencing problem as learning a string from its randomly drawn substrings. Under certain restrictions, this may be viewed as string learning in Valiant's distribution-free learning model and in this case we give an efficient learning algorithm and a quantitative bound on how many examples suffice. One major obstacle to our approach turns out to be a quite well-known open question on how to approximate a shortest common superstring of a set of strings, raised by a number of authors in the last 10 years [9], [29], [30]. We give the firstprovably good algorithm which approximates a shortest superstring of lengthn by a superstring of lengthO(n logn). The algorithm works equally well even in the presence of negative examples, i.e., when merging of some strings is prohibited.

Full Text