Efficient algorithms for biological stems search.

Tian Mi,Sanguthevar Rajasekaran

doi:10.1186/1471-2105-14-161

Abstract

BackgroundMotifs are significant patterns in DNA, RNA, and protein sequences, which play an important role in biological processes and functions, like identification of open reading frames, RNA transcription, protein binding, etc. Several versions of the motif search problem have been studied in the literature. One such version is called the Planted Motif Search (PMS)or (l, d)-motif Search. PMS is known to be NP complete. The time complexities of most of the planted motif search algorithms depend exponentially on the alphabet size. Recently a new version of the motif search problem has been introduced by Kuksa and Pavlovic. We call this version as the Motif Stems Search (MSS) problem. A motif stem is an l-mer (for some relevant value of l)with some wildcard characters and hence corresponds to a set of l-mers (without wildcards), some of which are (l, d)-motifs. Kuksa and Pavlovic have presented an efficient algorithm to find motif stems for inputs from large alphabets. Ideally, the number of stems output should be as small as possible since the stems form a superset of the motifs.ResultsIn this paper we propose an efficient algorithm for MSS and evaluate it on both synthetic and real data. This evaluation reveals that our algorithm is much faster than Kuksa and Pavlovic’s algorithm.ConclusionsOur MSS algorithm outperforms the algorithm of Kuksa and Pavlovic in terms of the run time as well as the number of stems output. Specifically, the stems output by our algorithm form a proper (and much smaller)subset of the stems output by Kuksa and Pavlovic’s algorithm.

Highlights

Motifs are significant patterns in DNA, RNA, and protein sequences, which play an important role in biological processes and functions, like identification of open reading frames, RNA transcription, protein binding, etc
Motif search on large alphabets we provide some definitions pertinent to Planted Motif Search (PMS) and Motif Stems Search (MSS) problems
We have evaluated our algorithms on the standard benchmark where n = 20, m = 600

Summary

Introduction

Motifs are significant patterns in DNA, RNA, and protein sequences, which play an important role in biological processes and functions, like identification of open reading frames, RNA transcription, protein binding, etc. Several versions of the motif search problem have been studied in the literature. A new version of the motif search problem has been introduced by Kuksa and Pavlovic. Kuksa and Pavlovic have presented an efficient algorithm to find motif stems for inputs from large alphabets. Motif search problem has been studied extensively due to its pivotal biological significance. Several types of algorithms have been proposed for motif search In one such class of methods, putative motifs in an input biological query sequence are predicted based on a database of known motifs. Examples include [1,2,3] In another class of methods, motifs are assumed to appear frequently in a set of sequences, like the same protein sequence from different species.

Methods

Results

Conclusion