Abstract

Phylogenomic inference of protein (or gene) function attempts to address the question, “What function does this protein perform?” in an evolutionary context. As originally outlined by Jonathan Eisen [1–3], phylogenomic inference of protein function is a multistep process involving selection of homologs, multiple sequence alignment (MSA), and phylogenetic tree construction; overlaying annotations on the tree topology; discriminating between orthologs and paralogs; and—finally—inferring the function of a protein based on the orthologs identified by this process and the annotations retrieved. Figure 1 shows an example of using annotated subfamily groupings to infer function, in a manner similar to [1]. One of us, while at Celera Genomics, separately came up with a similar approach for the functional classification of the human genome [4], based on the automated identification of functional subfamilies using the SCI-PHY algorithm and the use of subfamily hidden Markov models (HMMs) to classify novel sequences [5,6]. Our experiences over the past several years in developing computational pipelines for automating phylogenomic inference at the genome scale [7]—and the challenges we have faced in this effort—motivate this paper. Figure 1 Phylogenomic Analysis of Protein Function Using Subfamily Annotation In practice, phylogenomic inference of gene function is not often used. Far from it. The majority of novel sequences are assigned a putative function through the use of annotation transfer from the top hits in a database search. In our analysis of over 300,000 proteins in the UniProt database, only 3% of proteins with informative annotations (i.e., those not labelled as “hypothetical” or “unknown”) had experimental support for their annotations; 97% were annotated using electronic evidence alone. These annotations are uploaded to GenBank, where they persist even if they are eventually determined to be in error. The systematic errors associated with this annotation protocol have been pointed out by numerous investigators over the years [8–10]. The root causes of these errors are these: Gene duplication. This enables protein superfamilies to innovate novel functions on the same structural template, so that the top database hit may have a function distinct from the query. Domain shuffling. Domain fusion and fission events add an additional layer of complexity, as a query and database hit may share only a local region of homology and thus have entirely different molecular functions and structures. Propagation of existing errors in database annotations. This is particularly pernicious, as existing annotation errors are seldom detected and, even if detected, are not necessarily corrected. Evolutionary distance. Two proteins can share a common ancestor and domain structure, yet have very different functions simply due to their presence in very divergently related species. Phylogenomic analysis, properly applied, avoids these errors and provides a mechanism for detecting existing database annotation errors [3,7]. Why then is phylogenomic inference not used more widely? We believe this is due to four reasons. First, the actual frequency of annotation error is not known, so the gravity of the situation is not recognized. Second, phylogenomic inference is a much more complicated endeavor than a simple database search and requires significantly more expertise and computing resources. It is therefore not easily applied at the genome scale. Third, millions of dollars and years of effort have been poured into developing computational annotation systems that depend on annotation transfer from top database hits, perhaps overlaid with domain prediction methods such as PFAM or the NCBI CDD [11,12]. Fourth, phylogenomic approaches to protein function prediction have arisen only in the last few years, while database search methods have been available for much longer. Revolutions do not normally take place overnight. These four reasons result in phylogenomic inference being applied on a one-off basis, for a few protein superfamilies here and there. This may be about to change. A variety of software tools and algorithms enabling phylogenomic inference have been developed in recent years (see Table 1). Some of these methods have based annotation transfer on the identification of orthologs [13–15] or of functional subfamilies [6,16–21]. Other groups have used whole-tree analyses [22–24]. Still other groups employ expert knowledge to define functional subtypes and then develop statistical models to allow users to classify novel sequences [25,26]; these expert system-based approaches are unfortunately limited by the scarcity of experimental data for most protein families. Table 1 Resources for Phylogenomic Analysis It is worth examining the assumptions underlying these phylogenomic resources, and phylogenomic inference as a whole.

Highlights

  • Phylogenomic inference of protein function attempts to address the question, ‘‘What function does this protein perform?’’ in an evolutionary context

  • A variety of software tools and algorithms enabling phylogenomic inference have been developed in recent years

  • Phylogenomic inference is based on a fundamental assumption: the phylogenetic tree topology used as the basis of functional inference is correct

Read more

Summary

Functional Classification Using Phylogenomic Inference

A variety of software tools and algorithms enabling phylogenomic inference have been developed in recent years (see Table 1) Some of these methods have based annotation transfer on the identification of orthologs [13,14,15] or of functional subfamilies [6,16,17,18,19,20,21]. Still other groups employ expert knowledge to define functional subtypes and develop statistical models to allow users to classify novel sequences [25,26]; these expert system-based approaches are limited by the scarcity of experimental data for most protein families It is worth examining the assumptions underlying these phylogenomic resources, and phylogenomic inference as a whole. Duncan Brown and Kimmen Sjolander are at the University of California Berkeley, Berkeley, California, United States of America

Tree Topology Accuracy
The Reliability and Source of Existing Database Annotations
Functional Inference Based on Assumed Orthology
Findings
The Future of Phylogenomic Inference
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call