Abstract

The domains are the structural and functional units of proteins. With the avalanche of protein sequences generated in the postgenomic age, it is highly desired to develop effective methods for predicting the protein domains according to the sequences information alone, so as to facilitate the structure prediction of proteins and speed up their functional annotation. However, although many efforts have been made in this regard, prediction of protein domains from the sequence information still remains a challenging and elusive problem. Here, a new method was developed by combing the techniques of RF (random forest), mRMR (maximum relevance minimum redundancy), and IFS (incremental feature selection), as well as by incorporating the features of physicochemical and biochemical properties, sequence conservation, residual disorder, secondary structure, and solvent accessibility. The overall success rate achieved by the new method on an independent dataset was around 73%, which was about 28–40% higher than those by the existing method on the same benchmark dataset. Furthermore, it was revealed by an in-depth analysis that the features of evolution, codon diversity, electrostatic charge, and disorder played more important roles than the others in predicting protein domains, quite consistent with experimental observations. It is anticipated that the new method may become a high-throughput tool in annotating protein domains, or may, at the very least, play a complementary role to the existing domain prediction methods, and that the findings about the key features with high impacts to the domain prediction might provide useful insights or clues for further experimental investigations in this area. Finally, it has not escaped our notice that the current approach can also be utilized to study protein signal peptides, B-cell epitopes, HIV protease cleavage sites, among many other important topics in protein science and biomedicine.

Highlights

  • Protein domains are structural, evolutionary and functional units of proteins

  • Prediction of protein domains from the sequence information can facilitate the prediction of protein tertiary structure [1,2], the annotation of protein functions [2,3], the protein structure determination [4], protein engineering [5] as well as mutagenesis [6,7]

  • The mRMR Result Listed in the Online Supporting Information S2 are two kinds of outcomes obtained by running the mRMR software: one is called the ‘‘MaxRel feature list’’ that ranked all the features according to their relevance to the class of samples; the other one is the ‘‘mRMR feature list’’ that ranked the features according to the criteria of maximum relevance and minimum redundancy

Read more

Summary

Introduction

Evolutionary and functional units of proteins. Prediction of protein domains from the sequence information can facilitate the prediction of protein tertiary structure [1,2], the annotation of protein functions [2,3], the protein structure determination [4], protein engineering [5] as well as mutagenesis [6,7]. The concreted techniques involved in the ab-initio methods are the machine learning algorithms [35,39], artificial neural networks [40], and support vector machines [41,42], along with the high quality domain databases such as CATH [43], SCOP [44] and DALI [45] Since it needed to scan the entire sequence of a protein usually involving several hundreds of amino acids, and relied on the inputs containing weak domain information, the ab-initio method needed much more computational time and often suffered from low prediction accuracy. Let us describe how to deal with these steps

Materials and Methods
Results and Discussion
Method
Scan the Entire Protein Sequence to Refine the Domain Region Prediction
11. Useful Insights for Guiding Experiments or Being Validated by Experiments
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call