Abstract

BackgroundWorldwide structural genomics projects continue to release new protein structures at an unprecedented pace, so far nearly 6000, but only about 60% of these proteins have any sort of functional annotation.ResultsWe explored a range of features that can be used for the prediction of functional residues given a known three-dimensional structure. These features include various centrality measures of nodes in graphs of interacting residues: closeness, betweenness and page-rank centrality. We also analyzed the distance of functional amino acids to the general center of mass (GCM) of the structure, relative solvent accessibility (RSA), and the use of relative entropy as a measure of sequence conservation. From the selected features, neural networks were trained to identify catalytic residues. We found that using distance to the GCM together with amino acid type provide a good discriminant function, when combined independently with sequence conservation. Using an independent test set of 29 annotated protein structures, the method returned 411 of the initial 9262 residues as the most likely to be involved in function. The output 411 residues contain 70 of the annotated 111 catalytic residues. This represents an approximately 14-fold enrichment of catalytic residues on the entire input set (corresponding to a sensitivity of 63% and a precision of 17%), a performance competitive with that of other state-of-the-art methods.ConclusionsWe found that several of the graph based measures utilize the same underlying feature of protein structures, which can be simply and more effectively captured with the distance to GCM definition. This also has the added the advantage of simplicity and easy implementation. Meanwhile sequence conservation remains by far the most influential feature in identifying functional residues. We also found that due the rapid changes in size and composition of sequence databases, conservation calculations must be recalibrated for specific reference databases.

Highlights

  • Worldwide structural genomics projects continue to release new protein structures at an unprecedented pace, so far nearly 6000, but only about 60% of these proteins have any sort of functional annotation

  • For the purpose of feature selection, we explore the pairwise correlation between some of the attributes most frequently used for the prediction of functional residues, namely, the centrality measures of closeness, betweenness and page-rank, in addition to distance to the general center of mass (GCM) [35], relative solvent accessibility (RSA) and sequence conservation

  • Feature selection to predict functional residues Methods for the identification of functional residues rely on a wide variety of attributes that differentiate functional and non-functional amino acids

Read more

Summary

Introduction

Worldwide structural genomics projects continue to release new protein structures at an unprecedented pace, so far nearly 6000, but only about 60% of these proteins have any sort of functional annotation. The most frequent and basic approach to functionally characterize proteins in general is to transfer functional annotation between proteins based on sequence similarity [4], typically after searching sequence databases with tools like Blast [5] or other sensitive, profile based search approaches [6,7]. Given that the average sequence identity between structurally related proteins is ~8-9%, and most of these share less than 15% identity [13], we must expect a high degree of functional diversity in proteins with similar folds [14]. This indicates an imperative need for structures and structure based approaches for functional annotations of these proteins

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call