Improved algorithms for finding edit distance based motifs

Soumitra Pal,Sanguthevar Rajasekaran

doi:10.1109/bibm.2015.7359740

Abstract

Motif search is an important step in extracting meaningful patterns from biological data. The general problem of motif search is intractable. There is a pressing need to develop efficient exact and approximation algorithms to solve this problem. In this paper we present novel algorithms for solving the (l, d) Edit-distance-based Motif Search (EMS) problem: given two integers l, d and n biological strings, find all strings of length l that appear in each input string with atmost d substitutions, insertions and deletions. The algorithms for EMS are customarily evaluated on several challenging instances such as (9, 2), (11, 3), (13, 4), (15, 5), and so on. The best previously known algorithm, EMS1, solves up to instance (11, 3) in estimated 3 days. Our algorithm is more than 20 times faster than EMS1. For example, our algorithm solves instance (11, 3) in a couple of minutes and instance (14, 3) in a couple of hours. This significant improvement is due to a novel and provably efficient neighborhood generation technique introduced in this paper. Firstly, we show that it is enough to consider the neighbors which are at a distance exactly d from all possible substrings of the input strings. Secondly, we compactly represent the candidate motifs in the neighborhood using wildcard characters. Thirdly, we generate these compact candidate motifs nearly uniquely with very few repetitions. Finally, we use a trie based data structure to efficiently store the candidate motifs and to output the final motifs in a sorted order. We believe that the techniques we introduce in this paper are also applicable to other motif search problems such as the PMS.

Full Text