Abstract
BackgroundMotif search is an important step in extracting meaningful patterns from biological data. The general problem of motif search is intractable and there is a pressing need to develop efficient, exact and approximation algorithms to solve this problem. In this paper, we present several novel, exact, sequential and parallel algorithms for solving the (l,d) Edit-distance-based Motif Search (EMS) problem: given two integers l,d and n biological strings, find all strings of length l that appear in each input string with atmost d errors of types substitution, insertion and deletion.MethodsOne popular technique to solve the problem is to explore for each input string the set of all possible l-mers that belong to the d-neighborhood of any substring of the input string and output those which are common for all input strings. We introduce a novel and provably efficient neighborhood exploration technique. We show that it is enough to consider the candidates in neighborhood which are at a distance exactly d. We compactly represent these candidate motifs using wildcard characters and efficiently explore them with very few repetitions. Our sequential algorithm uses a trie based data structure to efficiently store and sort the candidate motifs. Our parallel algorithm in a multi-core shared memory setting uses arrays for storing and a novel modification of radix-sort for sorting the candidate motifs.ResultsThe algorithms for EMS are customarily evaluated on several challenging instances such as (8,1), (12,2), (16,3), (20,4), and so on. The best previously known algorithm, EMS1, is sequential and in estimated 3 days solves up to instance (16,3). Our sequential algorithms are more than 20 times faster on (16,3). On other hard instances such as (9,2), (11,3), (13,4), our algorithms are much faster. Our parallel algorithm has more than 600 % scaling performance while using 16 threads.ConclusionsOur algorithms have pushed up the state-of-the-art of EMS solvers and we believe that the techniques introduced in this paper are also applicable to other motif search problems such as Planted Motif Search (PMS) and Simple Motif Search (SMS).Electronic supplementary materialThe online version of this article (doi:10.1186/s12864-016-2789-9) contains supplementary material, which is available to authorized users.
Highlights
Motif search is an important step in extracting meaningful patterns from biological data
Given n strings S(1), S(2), . . . , S(n), each of length m from a fixed alphabet, and integers l, d, the Edit-distance-based Motif Search (EMS) problem is to find all patterns M of length l that occur in atleast one position in each S(i) with an edit distance of atmost d
Our approach uses following five ideas which can be applied to other motif search problems as well: Efficient neighborhood generation: We show that it is enough to consider the neighbors which are at a distance exactly d from all possible substrings of the input strings
Summary
Motif search is an important step in extracting meaningful patterns from biological data. We present several novel, exact, sequential and parallel algorithms for solving the (l, d) Edit-distance-based Motif Search (EMS) problem: given two integers l, d and n biological strings, find all strings of length l that appear in each input string with atmost d errors of types substitution, insertion and deletion. D and n biological strings the problem is to find all strings of length l that appear in each of the n input strings with atmost d mismatches. S(n), each of length m from a fixed alphabet , and integers l, d, the Edit-distance-based Motif Search (EMS) problem is to find all patterns M of length l that occur in atleast one position in each S(i) with an edit distance of atmost d. ED(X, Y ) stands for the edit distance between two strings X and Y
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.