Abstract

Identifying long pairwise maximal common substrings among a large set of sequences is a frequently used construct in computational biology, with applications in DNA sequence clustering and assembly. Due to errors made by sequencers, algorithms that can accommodate a small number of differences are of particular interest, but obtaining provably efficient solutions for such problems has been elusive. In this paper, we present a provably efficient algorithm with an expected run time guarantee of \(O(N\log ^k N+\mathsf {occ})\), where \(\mathsf {occ}\) is the output size, for the following problem: Given a collection \({\mathcal D}=\{S_1,S_2,\dots , S_n\}\) of n sequences of total length N, a length threshold \(\phi \) and a mismatch threshold \(k \ge 0\), report all k-mismatch maximal common substrings of length at least \(\phi \) over all pairs of sequences in \({\mathcal D}\). In addition, we present a result showing the hardness of this problem.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.