BindN+ for accurate prediction of DNA and RNA-binding residues from protein sequence features

Liangjiang Wang,Caiyan Huang,Jack Y Yang,Mary Qu Yang

doi:10.1186/1752-0509-4-s1-s3

Abstract

BackgroundUnderstanding how biomolecules interact is a major task of systems biology. To model protein-nucleic acid interactions, it is important to identify the DNA or RNA-binding residues in proteins. Protein sequence features, including the biochemical property of amino acids and evolutionary information in terms of position-specific scoring matrix (PSSM), have been used for DNA or RNA-binding site prediction. However, PSSM is rather designed for PSI-BLAST searches, and it may not contain all the evolutionary information for modelling DNA or RNA-binding sites in protein sequences.ResultsIn the present study, several new descriptors of evolutionary information have been developed and evaluated for sequence-based prediction of DNA and RNA-binding residues using support vector machines (SVMs). The new descriptors were shown to improve classifier performance. Interestingly, the best classifiers were obtained by combining the new descriptors and PSSM, suggesting that they captured different aspects of evolutionary information for DNA and RNA-binding site prediction. The SVM classifiers achieved 77.3% sensitivity and 79.3% specificity for prediction of DNA-binding residues, and 71.6% sensitivity and 78.7% specificity for RNA-binding site prediction.ConclusionsPredictions at this level of accuracy may provide useful information for modelling protein-nucleic acid interactions in systems biology studies. We have thus developed a web-based tool called BindN+ (http://bioinfo.ggc.org/bindn+/) to make the SVM classifiers accessible to the research community.

Highlights

Understanding how biomolecules interact is a major task of systems biology
DNA-binding site prediction The three biochemical features, including the hydrophobicity index, side chain pKa value (K), and molecular mass (M) of an amino acid, were previously used to construct support vector machines (SVMs) classifiers for DNA or RNAbinding residues [5], and these classifiers have been used by the BindN web server
Different SVM training parameters were tested, and the optimal parameter settings were based on the highest prediction strength and Receiver Operating Characteristic (ROC) area under the ROC curve (AUC)

Summary

Introduction

To model proteinnucleic acid interactions, it is important to identify the DNA or RNA-binding residues in proteins. Protein sequence features, including the biochemical property of amino acids and evolutionary information in terms of positionspecific scoring matrix (PSSM), have been used for DNA or RNA-binding site prediction. PSSM is rather designed for PSI-BLAST searches, and it may not contain all the evolutionary information for modelling DNA or RNA-binding sites in protein sequences. To understand the molecular mechanisms of the protein-nucleic acid recognition, it is important to identify the DNA or RNA-binding amino acid residues in proteins. The identification is straightforward if the structure of a protein-DNA or protein-RNA complex is known. With the rapid accumulation of sequence data, predictive methods are needed for identifying potential DNA or RNA-binding residues in protein sequences

Objectives

Methods

Results

Conclusion