Abstract

BackgroundMotif searching is an important step in the detection of rare events occurring in a set of DNA or protein sequences. One formulation of the problem is known as (l,d)-motif search or Planted Motif Search (PMS). In PMS we are given two integers l and d and n biological sequences. We want to find all sequences of length l that appear in each of the input sequences with at most d mismatches. The PMS problem is NP-complete. PMS algorithms are typically evaluated on certain instances considered challenging. Despite ample research in the area, a considerable performance gap exists because many state of the art algorithms have large runtimes even for moderately challenging instances.ResultsThis paper presents a fast exact parallel PMS algorithm called PMS8. PMS8 is the first algorithm to solve the challenging (l,d) instances (25,10) and (26,11). PMS8 is also efficient on instances with larger l and d such as (50,21). We include a comparison of PMS8 with several state of the art algorithms on multiple problem instances. This paper also presents necessary and sufficient conditions for 3 l-mers to have a common d-neighbor. The program is freely available at http://engr.uconn.edu/~man09004/PMS8/.ConclusionsWe present PMS8, an efficient exact algorithm for Planted Motif Search. PMS8 introduces novel ideas for generating common neighborhoods. We have also implemented a parallel version for this algorithm. PMS8 can solve instances not solved by any previous algorithms.

Highlights

  • Motif searching is an important step in the detection of rare events occurring in a set of DNA or protein sequences

  • This paper presents an efficient exact parallel algorithm for the Planted Motif Search (PMS) problem known as the (l, d) motif problem [1]

  • PMS8 was evaluated on the Hornet cluster in the Booth Engineering Center for Advanced Technology (BECAT) at University of Connecticut

Read more

Summary

Introduction

Motif searching is an important step in the detection of rare events occurring in a set of DNA or protein sequences. One formulation of the problem is known as (l, d)-motif search or Planted Motif Search (PMS). In PMS we are given two integers l and d and n biological sequences. PMS algorithms are typically evaluated on certain instances considered challenging. This paper presents an efficient exact parallel algorithm for the Planted Motif Search (PMS) problem known as the (l, d) motif problem [1]. A string of length l is caller an l-mer. Sn of length m each, from an alphabet and two integers l and d, identify all l-mers M, M ∈ l, that occur in at least one location in each of the n sequences with a Hamming distance of at most d. M is a motif if and only if ∀i, 1 ≤ i ≤ n, ∃ji, 1 ≤ ji ≤ m − l + 1, such that Hd(M, Si[ ji..ji + m − 1] ) ≤ d

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call