Abstract

The rapid increase in the number of proteins in sequence databases and the diversity of their functions challenge computational approaches for automated function prediction. Here, we introduce DeepFRI, a Graph Convolutional Network for predicting protein functions by leveraging sequence features extracted from a protein language model and protein structures. It outperforms current leading methods and sequence-based Convolutional Neural Networks and scales to the size of current sequence repositories. Augmenting the training set of experimental structures with homology models allows us to significantly expand the number of predictable functions. DeepFRI has significant de-noising capability, with only a minor drop in performance when experimental structures are replaced by protein models. Class activation mapping allows function predictions at an unprecedented resolution, allowing site-specific annotations at the residue-level in an automated manner. We show the utility and high performance of our method by annotating structures from the PDB and SWISS-MODEL, making several new confident function predictions. DeepFRI is available as a webserver at https://beta.deepfri.flatironinstitute.org/.

Highlights

  • The rapid increase in the number of proteins in sequence databases and the diversity of their functions challenge computational approaches for automated function prediction

  • We evaluate the function prediction performance by two measures commonly used in the CAFA challenges[27]: (1) protein-centric maximum F-score (Fmax) which measures the accuracy of assigning Gene Ontology (GO) terms/Enzyme Commission (EC) numbers to a protein, and is computed as a harmonic mean of the precision and recall; and (2) term-centric area under precision-recall (AUPR) curve, which measures the accuracy of assigning proteins to different GO terms/EC numbers

  • Our method Deep Functional Residue Identification (DeepFRI) is trained on protein structures from the Protein Data Bank (PDB) and SWISS-MODEL and rapidly predicts both GO terms and EC numbers of proteins and improves over state-of-the-art sequence-based methods on the majority of function terms

Read more

Summary

Introduction

The rapid increase in the number of proteins in sequence databases and the diversity of their functions challenge computational approaches for automated function prediction. We introduce DeepFRI, a Graph Convolutional Network for predicting protein functions by leveraging sequence features extracted from a protein language model and protein structures. It outperforms current leading methods and sequence-based Convolutional Neural Networks and scales to the size of current sequence repositories. Traditional machine learning classifiers, such as support vector machines, random forests, and logistic regression have been used extensively for protein function prediction They have established that integrative prediction schemes outperform homology-based function transfer[25,26] and that integration of multiple gene- and protein-network features typically outperform sequence-based features even though network features are often incomplete or unavailable. Systematic blind prediction challenges, such as the Critical Assessment of Functional Annotation (CAFA127, CAFA228, and CAFA329) and MouseFunc[30], are critical in the development of these methods and have shown that integrative machine learning and statistical methods outperform traditional sequence alignment-based methods (e.g., BLAST)[26]

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call