Abstract

Motif finding is a difficult problem that has been studied for over 20 years. Some older popular motif finders are not suitable for analysis of the large data sets generated by next-generation sequencing. We recently published an efficient approximation (STEME) to the EM algorithm that is at the core of many motif finders such as MEME. This approximation allows the EM algorithm to be applied to large data sets. In this work we describe several efficient extensions to STEME that are based on the MEME algorithm. Together with the original STEME EM approximation, these extensions make STEME a fully-fledged motif finder with similar properties to MEME. We discuss the difficulty of objectively comparing motif finders. We show that STEME performs comparably to existing prominent discriminative motif finders, DREME and Trawler, on 13 sets of transcription factor binding data in mouse ES cells. We demonstrate the ability of STEME to find long degenerate motifs which these discriminative motif finders do not find. As part of our method, we extend an earlier method due to Nagarajan et al. for the efficient calculation of motif E-values. STEME's source code is available under an open source license and STEME is available via a web interface.

Highlights

  • Transcriptional regulation Spatio-temporal regulation of gene expression is critical for the correct function of many cellular processes

  • In the rest of this paper we discuss the difficulty of evaluating motif finders; we describe the results of evaluating STEME, DREME and Trawler on 13 data sets from mouse ES cells; we show that STEME is better at finding long degenerate motifs in these data sets; we discuss how DREME and Trawler are sensitive to the choice of control sequences and how STEME is robust to this choice

  • Difficulty of evaluating motif finders Like many other tasks in computational biology such as protein interaction prediction [23] and gene regulatory network inference [24], the lack of a gold standard makes the evaluation of motif finding algorithms difficult at best

Read more

Summary

Introduction

Transcriptional regulation Spatio-temporal regulation of gene expression is critical for the correct function of many cellular processes. Proteins called transcription factors (TFs) bind to DNA and influence the rate of transcription of particular genes. These TFs usually exhibit sequence specific binding specificities such that they preferentially bind to particular binding sites in the genome (TFBSs). Several high-throughput experimental techniques have recently been developed to investigate the locations at which TFs bind. A typical experiment will report that a given TF binds to thousands of regions across the genome under a particular condition These techniques cannot determine the exact location of the TFBSs: the regions they report can be several hundred base pairs long

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call