Abstract

BackgroundAutomated protein function prediction methods are needed to keep pace with high-throughput sequencing. With the existence of many programs and databases for inferring different protein functions, a pipeline that properly integrates these resources will benefit from the advantages of each method. However, integrated systems usually do not provide mechanisms to generate customized databases to predict particular protein functions. Here, we describe a tool termed PIPA (Pipeline for Protein Annotation) that has these capabilities.ResultsPIPA annotates protein functions by combining the results of multiple programs and databases, such as InterPro and the Conserved Domains Database, into common Gene Ontology (GO) terms. The major algorithms implemented in PIPA are: (1) a profile database generation algorithm, which generates customized profile databases to predict particular protein functions, (2) an automated ontology mapping generation algorithm, which maps various classification schemes into GO, and (3) a consensus algorithm to reconcile annotations from the integrated programs and databases.PIPA's profile generation algorithm is employed to construct the enzyme profile database CatFam, which predicts catalytic functions described by Enzyme Commission (EC) numbers. Validation tests show that CatFam yields average recall and precision larger than 95.0%. CatFam is integrated with PIPA.We use an association rule mining algorithm to automatically generate mappings between terms of two ontologies from annotated sample proteins. Incorporating the ontologies' hierarchical topology into the algorithm increases the number of generated mappings. In particular, it generates 40.0% additional mappings from the Clusters of Orthologous Groups (COG) to EC numbers and a six-fold increase in mappings from COG to GO terms. The mappings to EC numbers show a very high precision (99.8%) and recall (96.6%), while the mappings to GO terms show moderate precision (80.0%) and low recall (33.0%).Our consensus algorithm for GO annotation is based on the computation and propagation of likelihood scores associated with GO terms. The test results suggest that, for a given recall, the application of the consensus algorithm yields higher precision than when consensus is not used.ConclusionThe algorithms implemented in PIPA provide automated genome-wide protein function annotation based on reconciled predictions from multiple resources.

Highlights

  • Automated protein function prediction methods are needed to keep pace with high-throughput sequencing

  • The algorithms implemented in PIPA provide automated genome-wide protein function annotation based on reconciled predictions from multiple resources

  • The major integrated methods in PIPA consist of the CatFam database, constructed by the profile database generation program, the 11 publicly-available databases integrated by InterPro, the Conserved Domains Database (CDD), the database of Clusters of Orthologous Groups (COG), the transmembrane and signal peptide prediction program Phobius [26], and the bacterial subcellular localization prediction program PSORTb [27]

Read more

Summary

Introduction

Automated protein function prediction methods are needed to keep pace with high-throughput sequencing. Compared to direct sequence-based methods, such as function inference through BLAST search, inference based on function-related sequence features, such as domain profiles or motifs, is more accurate and more sensitive for proteins that have low sequence similarity with proteins of known function. This has led to the development and popularity of a wide variety of feature databases, such as Pfam [5], ProDom [6], PROSITE [7], the Clusters of Orthologous Groups (COG) [8], and the Conserved Domains Database (CDD) [9]. They have proven to be more accurate and sensitive than feature databases developed for general-purpose protein function prediction

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call