Abstract

BackgroundComparative genomics methods such as phylogenetic profiling can mine powerful inferences from inherently noisy biological data sets. We introduce Sites Inferred by Metabolic Background Assertion Labeling (SIMBAL), a method that applies the Partial Phylogenetic Profiling (PPP) approach locally within a protein sequence to discover short sequence signatures associated with functional sites. The approach is based on the basic scoring mechanism employed by PPP, namely the use of binomial distribution statistics to optimize sequence similarity cutoffs during searches of partitioned training sets.ResultsHere we illustrate and validate the ability of the SIMBAL method to find functionally relevant short sequence signatures by application to two well-characterized protein families. In the first example, we partitioned a family of ABC permeases using a metabolic background property (urea utilization). Thus, the TRUE set for this family comprised members whose genome of origin encoded a urea utilization system. By moving a sliding window across the sequence of a permease, and searching each subsequence in turn against the full set of partitioned proteins, the method found which local sequence signatures best correlated with the urea utilization trait. Mapping of SIMBAL "hot spots" onto crystal structures of homologous permeases reveals that the significant sites are gating determinants on the cytosolic face rather than, say, docking sites for the substrate-binding protein on the extracellular face. In the second example, we partitioned a protein methyltransferase family using gene proximity as a criterion. In this case, the TRUE set comprised those methyltransferases encoded near the gene for the substrate RF-1. SIMBAL identifies sequence regions that map onto the substrate-binding interface while ignoring regions involved in the methyltransferase reaction mechanism in general. Neither method for training set construction requires any prior experimental characterization.ConclusionsSIMBAL shows that, in functionally divergent protein families, selected short sequences often significantly outperform their full-length parent sequence for making functional predictions by sequence similarity, suggesting avenues for improved functional classifiers. When combined with structural data, SIMBAL affords the ability to localize and model functional sites.

Highlights

  • Comparative genomics methods such as phylogenetic profiling can mine powerful inferences from inherently noisy biological data sets

  • We show that Sites Inferred by Metabolic Background Assertion Labeling (SIMBAL) can mine a protein sequence for short sequence regions, presumably containing critical sites, and that it outperforms other simple classifiers, such as BLAST matches to full-length proteins, for the task of classifying functionally diverged members of homology families

  • In Partial Phylogenetic Profiling, the implicit “training set” is all proteins from all genomes in the TRUE partition of the profile. This training set is noisy usually fewer than one protein in 1000 match the reference profile in a meaningful way - yet the power of profiling methods is beyond dispute

Read more

Summary

Introduction

Comparative genomics methods such as phylogenetic profiling can mine powerful inferences from inherently noisy biological data sets. Profile methods are being used increasingly to relate protein families to varied types of second traits such as phenotype, biological niche, transcriptional regulatory sites, and so on [2]. One such type of trait, metabolic capability, can be calculated by the Genome Properties system [3] using rules based largely on hidden Markov Models (HMMs) from the TIGRFAMs collection [4], as well as by the application of other methodologies such as Subsystems [5] or MetaCyc [6]. We have found that these assertions of metabolic background (profiles) provide excellent opportunities for launching phylogenetic profiling studies

Methods
Results
Discussion
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.