Improved Motif Detection in Large Sequence Sets with Random Sampling in a Kepler workflow

Sven Köhler,Phillip Seitzer,Marc T Facciotti,Bertram Ludäscher

doi:10.1016/j.procs.2012.04.219

Sven Köhler, Phillip Seitzer + Show 2 more

Open Access

https://doi.org/10.1016/j.procs.2012.04.219

Copy DOI

Journal: Procedia Computer Science	Publication Date: Jan 1, 2012
Citations: 1	License type: cc-by-nc-nd

Affiliation: University of California, Davis

Abstract

The discovery of functionally significant short, statistically overrepresented subsequence patterns (motifs) in a set of biological sequences is a challenging and important problem. Oftentimes, not all sequences in the set contain a motif. When using traditional methods these non-motif-containing sequences complicate the algorithmic discovery of motifs. MotifCatcher [1] is a framework developed by Seitzer et al. [2] that extends the sensitivity of existing motiffinding tools by applying random sampling to input sequences. First, multiple subsets of sequences are randomly constructed. Some of these subsets have a large number of motif-containing sequences. Traditional motif-finding tools are applied to all subsets in parallel, and significant motifs are recovered from appropriate subsets. Finally, this framework returns candidate functionally significant motifs and organizes them into a tree. which allows further analysis. However, the current implementation of MotifCatcher does not scale to large sets of sequences. This is due to the fact that the current implementation is not suited for distributed computing, and the whole sequence set is kept in memory. In order to achieve better scalability we present a redesigned implementation using the Kepler scientific workflow system that addresses those shortcomings. Kepler functions as a convenient front end to encode the computation into a Map Reduce pattern [3] to archive high parallelism for computationally intensive steps. Furthermore, Kepler greatly simplifies the substitution of the motif finding algorithm used on each subset. Using Kepler we ware able to increase the maximum size of the sequence sets in which motifs can be discovered and we were able to increase the number of subsets processed in parallel.

Full Text