A Monte Carlo-based framework enhances the discovery and interpretation of regulatory sequence motifs

Phillip Seitzer,Marc T Facciotti,Elizabeth G Wilbanks,David J Larsen

doi:10.1186/1471-2105-13-317

Phillip Seitzer, Marc T Facciotti + Show 2 more

Open Access

https://doi.org/10.1186/1471-2105-13-317

Copy DOI

Journal: BMC Bioinformatics	Publication Date: Nov 27, 2012
Citations: 62	License type: cc-by

Affiliation: University of California, Davis

Abstract

BackgroundDiscovery of functionally significant short, statistically overrepresented subsequence patterns (motifs) in a set of sequences is a challenging problem in bioinformatics. Oftentimes, not all sequences in the set contain a motif. These non-motif-containing sequences complicate the algorithmic discovery of motifs. Filtering the non-motif-containing sequences from the larger set of sequences while simultaneously determining the identity of the motif is, therefore, desirable and a non-trivial problem in motif discovery research.ResultsWe describe MotifCatcher, a framework that extends the sensitivity of existing motif-finding tools by employing random sampling to effectively remove non-motif-containing sequences from the motif search. We developed two implementations of our algorithm; each built around a commonly used motif-finding tool, and applied our algorithm to three diverse chromatin immunoprecipitation (ChIP) data sets. In each case, the motif finder with the MotifCatcher extension demonstrated improved sensitivity over the motif finder alone. Our approach organizes candidate functionally significant discovered motifs into a tree, which allowed us to make additional insights. In all cases, we were able to support our findings with experimental work from the literature.ConclusionsOur framework demonstrates that additional processing at the sequence entry level can significantly improve the performance of existing motif-finding tools. For each biological data set tested, we were able to propose novel biological hypotheses supported by experimental work from the literature. Specifically, in Escherichia coli, we suggested binding site motifs for 6 non-traditional LexA protein binding sites; in Saccharomyces cerevisiae, we hypothesize 2 disparate mechanisms for novel binding sites of the Cse4p protein; and in Halobacterium sp. NRC-1, we discoverd subtle differences in a general transcription factor (GTF) binding site motif across several data sets. We suggest that small differences in our discovered motif could confer specificity for one or more homologous GTF proteins. We offer a free implementation of the MotifCatcher software package at http://www.bme.ucdavis.edu/facciotti/resources_data/software/.

Highlights

Discovery of functionally significant short, statistically overrepresented subsequence patterns in a set of sequences is a challenging problem in bioinformatics
Theory and Motivation If an input set of sequence entries is corrupted with nonmotif containing sequence entries, from the standpoint of motif discovery, these non-motif containing sequences do not belong in the dataset – ideally, we would like to remove these sequences from the data set prior to carrying out a motif search
Tests on biological data LexA binding in E. coli The absence of clearly identifiable motifs in large subsets of chromatin immunoprecipitation (ChIP)-chip and ChIP-seq data is a common occurrence

Summary

Introduction

Discovery of functionally significant short, statistically overrepresented subsequence patterns (motifs) in a set of sequences is a challenging problem in bioinformatics. Interest in the problem of motif discovery has inspired work on two related problems: (1) the generation of a set of candidate motifs from a data set of sequences, and (2) the problem of meaningfully condensing a set of motifs into a single non-redundant motif (Figure 1). The relatedness of the problems of motif discovery, candidate motif set creation, and motif set condensation to a single motif has inspired the creation of integrated frameworks [21,26]– tools that first generate a candidate set of motifs (Figure 1B), and meaningfully parse this set of motifs to produce a meaningful representative motif (Figure 1C)

Methods

Results

Conclusion