Abstract

Recognition of protein structural fold is the starting point for many structure prediction tools and protein function inference. Fold prediction is computationally demanding and recognizing novel folds is difficult such that the majority of proteins have not been annotated for fold classification. Here we describe a new machine learning approach using a novel feature space that can be used for accurate recognition of all 1,221 currently known folds and inference of unknown novel folds. We show that our method achieves better than 94% accuracy even when many folds have only one training example. We demonstrate the utility of this method by predicting the folds of 34,330 human protein domains and showing that these predictions can yield useful insights into potential biological function, such as prediction of RNA-binding ability. Our method can be applied to de novo fold prediction of entire proteomes and identify candidate novel fold families.

Highlights

  • Recognition of protein structural fold is the starting point for many structure prediction tools and protein function inference

  • Protein sequences can theoretically form a vast range of structures, the number of distinct three-dimensional topologies (“folds”) observed in nature appears to be both finite and relatively small1: 1,221 folds are currently recognized in the SCOPe (Structural Classification of Proteins—extended) database[2], and the rate of new fold discoveries has diminished greatly over the past two decades

  • We show the utility of this feature space in conjunction with both support vector machine (SVM) and first-nearest neighbor (1NN) classifiers, and further develop our 1NN classifier into a full-scale fold recognition pipeline that can predict all currently known folds

Read more

Summary

Introduction

Recognition of protein structural fold is the starting point for many structure prediction tools and protein function inference. We describe a new machine learning approach using a novel feature space that can be used for accurate recognition of all 1,221 currently known folds and inference of unknown novel folds. Machine learning-based methods have been used, which can be designed either to recognize pairs of proteins with the same fold[9,10] or classify a protein into a fold[11,12] These methods have shown promising results for a subset of folds, they have so far not been able to generalize to the full-scale fold recognition problem. At the core of our method is a novel feature space constructed by threading protein sequences against a relatively small set of structure templates. The structure and function annotations of the entire human proteome are provided as a resource for the community

Methods
Results
Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.