Abstract

The nature of the alignment with gaps corresponding to a longest common subsequence (LCS) of two independent iid random sequences drawn from a finite alphabet is investigated. It is shown that such an optimal alignment typically matches pieces of similar short-length. This is of importance in understanding the structure of optimal alignments of two sequences. Moreover, it is also shown that any property, common to two subsequences, typically holds in most parts of the optimal alignment whenever this same property holds, with high probability, for strings of similar short-length. Our results should, in particular, prove useful for simulations since they imply that the re-scaled two dimensional representation of a LCS gets uniformly close to the diagonal as the length of the sequences grows without bound.

Highlights

  • Let x and y be two finite strings

  • The nature of the alignment with gaps corresponding to a longest common subsequence (LCS) of two independent iid random sequences drawn from a finite alphabet is investigated

  • An alignment corresponding to a LCS is called an optimal alignment (OA) or is said to be optimal

Read more

Summary

Introduction

Let x and y be two finite strings. A common subsequence of x and y is a subsequence of both x and y, while a longest common subsequence (LCS) of x and y is a common subsequence of maximal length. There are no macroscopic gaps in any optimal alignment and any such alignment must remain close to the main diagonal This closeness to the diagonal property has proved crucial in obtaining the first result on the limiting law of LCn, under a lower bound on the order of the variance, see [9]. Let us deal with our first goal and show that with high probability the optimal alignments satisfying (2.2), are such that most of their lengths ri − ri−1 are close to k. In the non-uniform case such bounds might be far from optimal and require the knowledge of the probability associated with each letter To further address this question, let us present a lemma, with a somehow easier approach, to deal with the generic case. Broadly, Theorem 2.2 asserts that for any > 0, there exists k large enough, but fixed, such that if X is divided into segments of length k typically (at least a fraction 1 − of segments), and with high probability, the LCSs match these segments to segments of similar length in Y

Proof of the main theorem
Closeness to the diagonal
Short string-lengths properties are generic
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call