Abstract

The discovery of functionally significant short, statistically overrepresented subsequence patterns (motifs) in a set of biological sequences is a challenging and important problem. Oftentimes, not all sequences in the set contain a motif. When using traditional methods these non-motif-containing sequences complicate the algorithmic discovery of motifs. MotifCatcher [1] is a framework developed by Seitzer et al. [2] that extends the sensitivity of existing motiffinding tools by applying random sampling to input sequences. First, multiple subsets of sequences are randomly constructed. Some of these subsets have a large number of motif-containing sequences. Traditional motif-finding tools are applied to all subsets in parallel, and significant motifs are recovered from appropriate subsets. Finally, this framework returns candidate functionally significant motifs and organizes them into a tree. which allows further analysis. However, the current implementation of MotifCatcher does not scale to large sets of sequences. This is due to the fact that the current implementation is not suited for distributed computing, and the whole sequence set is kept in memory. In order to achieve better scalability we present a redesigned implementation using the Kepler scientific workflow system that addresses those shortcomings. Kepler functions as a convenient front end to encode the computation into a Map Reduce pattern [3] to archive high parallelism for computationally intensive steps. Furthermore, Kepler greatly simplifies the substitution of the motif finding algorithm used on each subset. Using Kepler we ware able to increase the maximum size of the sequence sets in which motifs can be discovered and we were able to increase the number of subsets processed in parallel.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call