Prediction of catalytic residues using Support Vector Machine with selected protein sequence and structural properties.

Natalia V Petrova,Cathy H Wu

doi:10.1186/1471-2105-7-312

Abstract

BackgroundThe number of protein sequences deriving from genome sequencing projects is outpacing our knowledge about the function of these proteins. With the gap between experimentally characterized and uncharacterized proteins continuing to widen, it is necessary to develop new computational methods and tools for functional prediction. Knowledge of catalytic sites provides a valuable insight into protein function. Although many computational methods have been developed to predict catalytic residues and active sites, their accuracy remains low, with a significant number of false positives. In this paper, we present a novel method for the prediction of catalytic sites, using a carefully selected, supervised machine learning algorithm coupled with an optimal discriminative set of protein sequence conservation and structural properties.ResultsTo determine the best machine learning algorithm, 26 classifiers in the WEKA software package were compared using a benchmarking dataset of 79 enzymes with 254 catalytic residues in a 10-fold cross-validation analysis. Each residue of the dataset was represented by a set of 24 residue properties previously shown to be of functional relevance, as well as a label {+1/-1} to indicate catalytic/non-catalytic residue. The best-performing algorithm was the Sequential Minimal Optimization (SMO) algorithm, which is a Support Vector Machine (SVM). The Wrapper Subset Selection algorithm further selected seven of the 24 attributes as an optimal subset of residue properties, with sequence conservation, catalytic propensities of amino acids, and relative position on protein surface being the most important features.ConclusionThe SMO algorithm with 7 selected attributes correctly predicted 228 of the 254 catalytic residues, with an overall predictive accuracy of more than 86%. Missing only 10.2% of the catalytic residues, the method captures the fundamental features of catalytic residues and can be used as a "catalytic residue filter" to facilitate experimental identification of catalytic residues for proteins with known structure but unknown function.

Highlights

The number of protein sequences deriving from genome sequencing projects is outpacing our knowledge about the function of these proteins
The performance of the algorithms was measured by the Matthews correlation coefficients (MCC) in a 10-fold cross-validation analysis using three balanced datasets generated from the benchmarking data, each with an equal number of noncatalytic residues randomly chosen from all non-catalytic residues of the benchmarking dataset
The analysis of the optimal subset selected from the initial

Summary

Introduction

The number of protein sequences deriving from genome sequencing projects is outpacing our knowledge about the function of these proteins. Knowledge of catalytic sites provides a valuable insight into protein function. Many computational methods have been developed to predict catalytic residues and active sites, their accuracy remains low, with a significant number of false positives. We present a novel method for the prediction of catalytic sites, using a carefully selected, supervised machine learning algorithm coupled with an optimal discriminative set of protein sequence conservation and structural properties. The high-throughput genome projects have resulted in a rapid accumulation of predicted protein sequences for a large number of organisms. Knowledge of the location of catalytic residues provides valuable insight into the mechanisms of enzyme-catalyzed reactions

Objectives

Methods

Results

Conclusion