Abstract
Identifying long pairwise maximal common substrings among a large set of sequences is a frequently used construct in computational biology, with applications in DNA sequence clustering and assembly. Due to errors made by sequencers, algorithms that can accommodate a small number of differences are of particular interest, but obtaining provably efficient solutions for such problems has been elusive. In this paper, we present a provably efficient algorithm with an expected run time guarantee of \(O(N\log ^k N+\mathsf {occ})\), where \(\mathsf {occ}\) is the output size, for the following problem: Given a collection \({\mathcal D}=\{S_1,S_2,\dots , S_n\}\) of n sequences of total length N, a length threshold \(\phi \) and a mismatch threshold \(k \ge 0\), report all k-mismatch maximal common substrings of length at least \(\phi \) over all pairs of sequences in \({\mathcal D}\). In addition, we present a result showing the hardness of this problem.
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have