Abstract

BackgroundThe goal of most programs developed to find transcription factor binding sites (TFBSs) is the identification of discrete sequence motifs that are significantly over-represented in a given set of sequences where a transcription factor (TF) is expected to bind. These programs assume that the nucleotide conservation of a specific motif is indicative of a selective pressure required for the recognition of a TF for its corresponding TFBS. Despite their extensive use, the accuracies reached with these programs remain low. In many cases, true TFBSs are excluded from the identification process, especially when they correspond to low-affinity but important binding sites of regulatory systems.ResultsWe developed a computational protocol based on molecular and structural criteria to perform biologically meaningful and accurate phylogenetic footprinting analyses. Our protocol considers fundamental aspects of the TF-DNA binding process, such as: i) the active homodimeric conformations of TFs that impose symmetric structures on the TFBSs, ii) the cooperative binding of TFs, iii) the effects of the presence or absence of co-inducers, iv) the proximity between two TFBSs or one TFBS and a promoter that leads to very long spurious motifs, v) the presence of AT-rich sequences not recognized by the TF but that are required for DNA flexibility, and vi) the dynamic order in which the different binding events take place to determine a regulatory response (i.e., activation or repression). In our protocol, the abovementioned criteria were used to analyze a profile of consensus motifs generated from canonical Phylogenetic Footprinting Analyses using a set of analysis windows of incremental sizes. To evaluate the performance of our protocol, we analyzed six members of the LysR-type TF family in Gammaproteobacteria.ConclusionsThe identification of TFBSs based exclusively on the significance of the over-representation of motifs in a set of sequences might lead to inaccurate results. The consideration of different molecular and structural properties of the regulatory systems benefits the identification of TFBSs and enables the development of elaborate, biologically meaningful and precise regulatory models that offer a more integrated view of the dynamics of the regulatory process of transcription.Electronic supplementary materialThe online version of this article (doi:10.1186/s12864-016-3025-3) contains supplementary material, which is available to authorized users.

Highlights

  • The goal of most programs developed to find transcription factor binding sites (TFBSs) is the identification of discrete sequence motifs that are significantly over-represented in a given set of sequences where a transcription factor (TF) is expected to bind

  • The true TFBSs are excluded from the identification process or are imprecisely identified, especially when they correspond to low-affinity but important binding sites of the regulatory systems

  • To assess the performance of our protocol, we performed in silico identifications of the binding sites of TFs of six regulatory systems that are members of the LysR-type family in Gammaproteobacteria, with target genes (TGs) commonly transcribed in divergent orientations

Read more

Summary

Introduction

The goal of most programs developed to find transcription factor binding sites (TFBSs) is the identification of discrete sequence motifs that are significantly over-represented in a given set of sequences where a transcription factor (TF) is expected to bind. The in silico identification of transcription factorbinding sites (TFBSs) is a key issue for many molecular biology studies aimed at characterizing regulatory elements in genome sequences These analyses have been performed by considering either different co-regulated genes in one genome [4] or a set of upstream regions of orthologous genes in closely related genomes, a procedure known as phylogenetic footprinting analysis [5,6,7,8]. To evaluate the significance of these TFBS predictions, different approaches have been developed based on theoretical models, such as log-odds, entropy-weighted values [14] or the combination of theoretical and empirical score distributions [15] Despite their extensive use, the accuracies reached with these programs remain low. The significance of a motif given its overrepresentation in a set of sequences of co-regulated genes is not necessarily the best way to identify the set of TFBSs for a given regulon

Methods
Results
Discussion
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.