Abstract
BackgroundTo explain the vastly different phenotypes exhibited by the same organism under different conditions, it is essential that we understand how the organism's genes are coordinately regulated. While there are many excellent tools for predicting sequences encoding proteins or RNA genes, few algorithms exist to predict regulatory sequences on a genome wide scale with no prior information.ResultsTo identify motifs involved in the control of transcription, an algorithm was developed that searches upstream of operons for improbably frequent dimers. The algorithm was applied to the B. subtilis genome, which is predicted to encode for approximately 200 DNA binding proteins. The dimers found to be over-represented could be clustered into 317 distinct groups, each thought to represent a class of motifs uniquely recognized by some transcription factor. For each cluster of dimers, a representative weight matrix was derived and scored over the regions upstream of the operons to predict the sites recognized by the cluster's factor, and a putative regulon of the operons immediately downstream of the sites was inferred. The distribution in number of operons per predicted regulon is comparable to that for well characterized transcription factors. The most highly over-represented dimers matched σA, the T-box, and σW sites. We have evidence to suggest that at least 52 of our clusters of dimers represent actual regulatory motifs, based on the groups' weight matrix matches to experimentally characterized sites, the functional similarity of the component operons of the groups' regulons, and the positional biases of the weight matrix matches. All predictions are assigned a significance value, and thresholds are set to avoid false positives. Where possible, we examine our false negatives, drawing examples from known regulatory motifs and regulons inferred from RNA expression data.ConclusionsWe have demonstrated that in the case of B. subtilis our algorithm allows for the genome wide identification of regulatory sites. As well as recovering known sites, we predict new sites of yet uncharacterized factors. Results can be viewed at .
Highlights
To explain the vastly different phenotypes exhibited by the same organism under different conditions, it is essential that we understand how the organism's genes are coordinately regulated
Bacterial genome annotation has generally been confined to the prediction of sequences encoding proteins and prominent families of RNA genes
When it is realized that even for E. coli less than 20% of the operons have been thoroughly examined upstream for regulatory motifs and less than 1/4 of the 300 or more putative DNA binding proteins have known sites, it is apparent that the automatic methods for inferring regulatory motifs must approach those used for inferring protein coding sequences and function if the full potential of the 'genomic revolution' is to be realized
Summary
Nomenclature To simplify terminology, we will use the term 'operon' in what follows to denote our putative operons predicted as described in "Putative operons and upstream sequences." Since a particular weight matrix is thought to represent sites uniquely recognized by some transcription factor, the term 'regulon' will be used for the group of operons having a match to the matrix directly upstream, i.e. direct targets of the factor. In the list of our 10 most significant dimers, the four dimers ttgaN20ataat, ttgaN19tata, ttgaN21taat, and ttgacN19ataat all correspond to the consensus sequence TTGACAN17TATAAT recognized by the primary sigma factor σA [20], the two dimers ggtggN3cgcg and agggtN4ccgcg correspond to the T-box [27] with a known consensus sequence AANNAGGGTGGTACCGCGNN involved in the alternate transcription termination regulation of the aminoacyl-tRNA synthetases, and the consensus sequence TGAAACN16CGTA recognized by the antimicrobial resistance sigma factor σW [20] is represented by the dimer gaaacN16cgta. Of the 52 matrices, 10 represent experimentally characterized regulatory factors, 30 have regulons that contain a disproportionate number of operons with related functions, and 32 have matches exhibiting some positional bias. The matrices are sub-divided into categories according to the means by which they were identified: by comparison to documented regulatory mechanisms, by inspecting the operons in a matrix's regulon for related functions, and by examining the matrix's matches for positional biases. T-box, alternate transcription termination regulation of aminoacyl-tRNA synthetases [27]
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.