Abstract

BackgroundIdentification of transcription factor binding sites (also called ‘motif discovery’) in DNA sequences is a basic step in understanding genetic regulation. Although many successful programs have been developed, the problem is far from being solved on account of diversity in gene expression/regulation and the low specificity of binding sites. State-of-the-art algorithms have their own constraints (e.g., high time or space complexity for finding long motifs, low precision in identification of weak motifs, or the OOPS constraint: one occurrence of the motif instance per sequence) which limit their scope of application.ResultsIn this paper, we present a novel and fast algorithm we call TFBSGroup. It is based on community detection from a graph and is used to discover long and weak (l,d) motifs under the ZOMOPS constraint (zero, one or multiple occurrence(s) of the motif instance(s) per sequence), where l is the length of a motif and d is the maximum number of mutations between a motif instance and the motif itself. Firstly, TFBSGroup transforms the (l, d) motif search in sequences to focus on the discovery of dense subgraphs within a graph. It identifies these subgraphs using a fast community detection method for obtaining coarse-grained candidate motifs. Next, it greedily refines these candidate motifs towards the true motif within their own communities. Empirical studies on synthetic (l, d) samples have shown that TFBSGroup is very efficient (e.g., it can find true (18, 6), (24, 8) motifs within 30 seconds). More importantly, the algorithm has succeeded in rapidly identifying motifs in a large data set of prokaryotic promoters generated from the Escherichia coli database RegulonDB. The algorithm has also accurately identified motifs in ChIP-seq data sets for 12 mouse transcription factors involved in ES cell pluripotency and self-renewal.ConclusionsOur novel heuristic algorithm, TFBSGroup, is able to quickly identify nearly exact matches for long and weak (l, d) motifs in DNA sequences under the ZOMOPS constraint. It is also capable of finding motifs in real applications. The source code for TFBSGroup can be obtained from http://bioinformatics.bioengr.uic.edu/TFBSGroup/.

Highlights

  • Identification of transcription factor binding sites in DNA sequences is a basic step in understanding genetic regulation

  • We first tested TFBSGroup on a series of synthetic (l, d) samples and compared it with iTriplet and RecMotif. iTriplet and RecMotif are both sample-driven algorithms which heuristically extract q-cliques from an N-partite graph (q = N for RecMotif because of the OOPS constraint)

  • We used TFBSGroup on a large data set of prokaryotic promoters generated from the Escherichia coli database RegulonDB for the purpose of finding real long and weak motifs

Read more

Summary

Results

If no exact match or similar result was found in the literature, we listed the top ranked motif consensuses with the most binding locations in the middle regions of the sequences. The LexA data includes 10 DNA sequences of length 222 with the consensus CTGTnnnnnnnnnnCAG (consensus model: (16, 6)) and 10 actual binding sites. Is close to or greater than 50% on these three data sets, where TP is the number of true positive sites and FP is the number of false positive sites It should be pointed out, that some results marked with an ‘*’ in Figures 1 and 2 may not be satisfactory due to the low specificity of binding sites for the TFs, insufficient number of sequences from which to a draw statistical conclusion, or a lack of knowledge of the proper (l, d) models. The TFBS sequences of this alternative motif were complementary to those of the main motif in Figure 3 for each of three TFs

Conclusions
Background
Conclusions and discussion
Methods
24. Tompa M
46. Plumbridge J
54. Fortunato S

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.