Abstract

Discovering patterns in biological sequences is a crucial problem. For example, the identification of patterns in DNA sequences has resulted in the determination of open reading frames, identification of gene promoter elements, intron/exon splicing sites, and SH RNAs, location of RNA degradation signals, identification of alternative splicing sites, etc. In protein sequences, patterns have led to domain identification, location of protease cleavage sites, identification of signal peptides, protein interactions, determination of protein degradation elements, identification of protein trafficking elements, discovery of short functional motifs, etc. In this paper we focus on the identification of an important class of patterns, namely, motifs. We study the (ℓ, d) motif search problem or Planted Motif Search (PMS). PMS receives as input n strings and two integers ℓ and d. It returns all sequences M of length ℓ that occur in each input string, where each occurrence differs from M in at most d positions. Another formulation is quorum PMS (qPMS), where the motif appears in at least q% of the strings. We introduce qPMS9, a parallel exact qPMS algorithm that offers significant runtime improvements on DNA and protein datasets. qPMS9 solves the challenging DNA (ℓ, d)-instances (28, 12) and (30, 13). The source code is available at https://code.google.com/p/qpms9/.

Highlights

  • Correspondence and Discovering patterns in biological sequences is a crucial problem

  • We study the (, d) motif search problem or Planted Motif Search (PMS)

  • It returns all sequences M of length, that occur in each input string, where each occurrence differs from M in at most d positions

Read more

Summary

Introduction

Correspondence and Discovering patterns in biological sequences is a crucial problem. It returns all possible biological sequences M of length , such that M occurs in each of the input strings, and each occurrence differs from M in at most d positions Buhler and Tompa have employed PMS algorithms to find known transcriptional regulatory elements upstream of several eukaryotic genes They have used orthologous sequences from different organisms upstream of four different genes: preproinsulin, dihydrofolate reductase (DHFR), metallothioneins, and cfos. They have employed the upstream regions involved in purine metabolism from three Pyrococcus genomes.

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call