Abstract

A new monotonicity-constrained maximum likelihood approach, called Partial Order Optimum Likelihood (POOL), is presented and applied to the problem of functional site prediction in protein 3D structures, an important current challenge in genomics. The input consists of electrostatic and geometric properties derived from the 3D structure of the query protein alone. Sequence-based conservation information, where available, may also be incorporated. Electrostatics features from THEMATICS are combined with multidimensional isotonic regression to form maximum likelihood estimates of probabilities that specific residues belong to an active site. This allows likelihood ranking of all ionizable residues in a given protein based on THEMATICS features. The corresponding ROC curves and statistical significance tests demonstrate that this method outperforms prior THEMATICS-based methods, which in turn have been shown previously to outperform other 3D-structure-based methods for identifying active site residues. Then it is shown that the addition of one simple geometric property, the size rank of the cleft in which a given residue is contained, yields improved performance. Extension of the method to include predictions of non-ionizable residues is achieved through the introduction of environment variables. This extension results in even better performance than THEMATICS alone and constitutes to date the best functional site predictor based on 3D structure only, achieving nearly the same level of performance as methods that use both 3D structure and sequence alignment data. Finally, the method also easily incorporates such sequence alignment data, and when this information is included, the resulting method is shown to outperform the best current methods using any combination of sequence alignments and 3D structures. Included is an analysis demonstrating that when THEMATICS features, cleft size rank, and alignment-based conservation scores are used individually or in combination THEMATICS features represent the single most important component of such classifiers.

Highlights

  • Development of function prediction capabilities is a major challenge in genomics

  • As described in more detail in the Materials and Methods section, the results presented in this paper are based on two sets of proteins, a set of 64 test proteins selected randomly from the Catalytic Site Atlas (CSA) database [16,17] and a 160-protein set covering most of the original CSA database

  • We presented the application of the Partial Order Optimum Likelihood (POOL) method using THEMATICS plus some other features for protein active site prediction

Read more

Summary

Introduction

Development of function prediction capabilities is a major challenge in genomics. Structural genomics projects are determining the 3D structures of expressed proteins on a high throughput basis. The determination of function from 3D structure has proved to be a challenging task; the functions of most of these structural genomics proteins remain unknown. Based predictive methods can help to guide and accelerate functional annotation. The first step toward the prediction of the function of a protein from its 3D structure is to determine its local site of interaction where catalysis and/or ligand recognition occurs. Such capabilities have many important practical implications for biology and medicine

Objectives
Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call