Better prediction of protein contact number using a support vector regression analysis of amino acid sequence

Zheng Yuan

doi:10.1186/1471-2105-6-248

Abstract

BackgroundProtein tertiary structure can be partly characterized via each amino acid's contact number measuring how residues are spatially arranged. The contact number of a residue in a folded protein is a measure of its exposure to the local environment, and is defined as the number of Cβ atoms in other residues within a sphere around the Cβ atom of the residue of interest. Contact number is partly conserved between protein folds and thus is useful for protein fold and structure prediction. In turn, each residue's contact number can be partially predicted from primary amino acid sequence, assisting tertiary fold analysis from sequence data. In this study, we provide a more accurate contact number prediction method from protein primary sequence.ResultsWe predict contact number from protein sequence using a novel support vector regression algorithm. Using protein local sequences with multiple sequence alignments (PSI-BLAST profiles), we demonstrate a correlation coefficient between predicted and observed contact numbers of 0.70, which outperforms previously achieved accuracies. Including additional information about sequence weight and amino acid composition further improves prediction accuracies significantly with the correlation coefficient reaching 0.73. If residues are classified as being either "contacted" or "non-contacted", the prediction accuracies are all greater than 77%, regardless of the choice of classification thresholds.ConclusionThe successful application of support vector regression to the prediction of protein contact number reported here, together with previous applications of this approach to the prediction of protein accessible surface area and B-factor profile, suggests that a support vector regression approach may be very useful for determining the structure-function relation between primary protein sequence and higher order consecutive protein structural and functional properties.

Highlights

Protein tertiary structure can be partly characterized via each amino acid's contact number measuring how residues are spatially arranged
In our former work, we studied the dependence of protein accessible surface area (ASA) [6,7] and B-factor [8] on primary sequence
We provide a new method for the prediction of protein contact number

Summary

Introduction

Protein tertiary structure can be partly characterized via each amino acid's contact number measuring how residues are spatially arranged. We provide a more accurate contact number prediction method from protein primary sequence. One protein structural feature is of particular interest here, namely, residue contact number (CN) which can be used to enhance protein fold recognition [1]. This measure has been regarded as the conserved solvent exposure descriptor of similar folds (page number not for citation purposes). We seek to use protein contact number to assist with the tertiary fold prediction of novel proteins for which an accurate functional relationship between a protein's primary sequence and its residues' contact numbers must be determined. As a result, we achieve more accurate predicted contact numbers than have been achieved to date

Methods

Results

Discussion

Conclusion