Abstract

BackgroundPredicting protein function from primary sequence is an important open problem in modern biology. Not only are there many thousands of proteins of unknown function, current approaches for predicting function must be improved upon. One problem in particular is overly-specific function predictions which we address here with a new statistical model of the relationship between protein sequence similarity and protein function similarity.MethodologyOur statistical model is based on sets of proteins with experimentally validated functions and numeric measures of function specificity and function similarity derived from the Gene Ontology. The model predicts the similarity of function between two proteins given their amino acid sequence similarity measured by statistics from the BLAST sequence alignment algorithm. A novel aspect of our model is that it predicts the degree of function similarity shared between two proteins over a continuous range of sequence similarity, facilitating prediction of function with an appropriate level of specificity.SignificanceOur model shows nearly exact function similarity for proteins with high sequence similarity (bit score >244.7, e-value >1e−62, non-redundant NCBI protein database (NRDB)) and only small likelihood of specific function match for proteins with low sequence similarity (bit score <54.6, e-value <1e−05, NRDB). For sequence similarity ranges in between our annotation model shows an increasing relationship between function similarity and sequence similarity, but with considerable variability. We applied the model to a large set of proteins of unknown function, and predicted functions for thousands of these proteins ranging from general to very specific. We also applied the model to a data set of proteins with previously assigned, specific functions that were electronically based. We show that, on average, these prior function predictions are more specific (quite possibly overly-specific) compared to predictions from our model that is based on proteins with experimentally determined function.

Highlights

  • Three measures of this were considered: 1) Gene Ontology (GO) term depth of the common ancestral GO term for the GO terms assigned to the two proteins in a BLAST alignment, 2) the IC of the common ancestral GO term, and 3) the Relative Information Content (RIC)

  • RIC is the ratio of the IC of the common ancestral GO term and the mean IC of the GO terms for two proteins in a BLAST alignment

  • Whereas IC has less variability and a stronger relationship with BLAST bit score than GO term depth, normalizing IC by generating the RIC metric reduces the influence on the model of the variability of IC values in the training data

Read more

Summary

Introduction

Annotation, remains an important open problem in biology [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23]. The vast majority of proteins have been annotated through predictive methods which work by comparing protein sequences and determining their degree of similarity This is carried out by computer programs such as BLAST [25,26] or various other tools and databases [27,28,29,30,31,32,33]. An important part of the annotation puzzle that is missing in particular is an in-depth understanding of the relationship between sequence similarity and function similarity over a continuous range and the amount of variability inherent in the relationship over all ranges of sequence similarity Solving this puzzle requires generation of a sufficiently large and diverse data set of proteins with experimentally characterized function, determining the best way to represent function for modeling purposes, and both appropriately building and applying a proper statistical model.

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call