An Efficient Algorithm for Finding All Pairs k-Mismatch Maximal Common Substrings

Sharma V Thankachan,Sriram P Chockalingam,Srinivas Aluru

doi:10.1007/978-3-319-38782-6_1

Sharma V Thankachan, Sriram P Chockalingam + Show 1 more

https://doi.org/10.1007/978-3-319-38782-6_1

Copy DOI

Export

Save

Cite

Abstract
Full-Text
Similar Papers

Abstract

Listen

Identifying long pairwise maximal common substrings among a large set of sequences is a frequently used construct in computational biology, with applications in DNA sequence clustering and assembly. Due to errors made by sequencers, algorithms that can accommodate a small number of differences are of particular interest, but obtaining provably efficient solutions for such problems has been elusive. In this paper, we present a provably efficient algorithm with an expected run time guarantee of \(O(N\log ^k N+\mathsf {occ})\), where \(\mathsf {occ}\) is the output size, for the following problem: Given a collection \({\mathcal D}=\{S_1,S_2,\dots , S_n\}\) of n sequences of total length N, a length threshold \(\phi \) and a mismatch threshold \(k \ge 0\), report all k-mismatch maximal common substrings of length at least \(\phi \) over all pairs of sequences in \({\mathcal D}\). In addition, we present a result showing the hardness of this problem.

Full Text