Abstract

The GO-Cellular Component (GO-CC) ontology provides a controlled vocabulary for the consistent description of the subcellular compartments or macromolecular complexes where proteins may act. Current machine learning-based methods used for the automated GO-CC annotation of proteins suffer from the inconsistency of individual GO-CC term predictions. Here, we present FGGA-CC+, a class of hierarchical graph-based classifiers for the consistent GO-CC annotation of protein coding genes at the subcellular compartment or macromolecular complex levels. Aiming to boost the accuracy of GO-CC predictions, we make use of the protein localization knowledge in the GO-Biological Process (GO-BP) annotations to boost the accuracy of GO-CC prediction. As a result, FGGA-CC+ classifiers are built from annotation data in both the GO-CC and GO-BP ontologies. Due to their graph-based design, FGGA-CC+ classifiers are fully interpretable and their predictions amenable to expert analysis. Promising results on protein annotation data from five model organisms were obtained. Additionally, successful validation results in the annotation of a challenging subset of tandem duplicated genes in the tomato non-model organism were accomplished. Overall, these results suggest that FGGA-CC+ classifiers can indeed be useful for satisfying the huge demand of GO-CC annotation arising from ubiquitous high throughout sequencing and proteomic projects.

Highlights

  • The GO-Cellular Component (GO-CC) ontology provides a controlled vocabulary for the consistent description of the subcellular compartments or macromolecular complexes where proteins may act

  • FGGA-CC+ classifiers were evaluated on protein sequences from five model organisms, D. rario, A. thaliana, S. cerevisiae, D. melanogaster and M. musculus, using a 5-fold cross-validation approach

  • A first insight into the benefits of requiring consistent GO-CC predictions can be appreciated in Fig. 1 where FGGA-CC+ processing over flat GO-CC predictions promotes consistency and reduces the number of false-positives

Read more

Summary

Introduction

The GO-Cellular Component (GO-CC) ontology provides a controlled vocabulary for the consistent description of the subcellular compartments or macromolecular complexes where proteins may act. A combination of chemical crosslinking[13], mass spectrometry, and cryo-electron microscopy[14] methods can be used to accurately determine the structure and function of macromolecular complexes All these advanced experimental methods are beginning to bear fruits[15,16], their time-consuming nature and elevated costs[17,18] make incompatible with current GO-CC protein annotation demands from ubiquitous large-scale sequencing and proteomic projects. In this scenario, in-silico methods for the automated GO-CC annotation of proteins, i.e., for predicting their localization, at the subcellular structure or macromolecular complex levels, become promising alternatives[19,20,21,22]. False positive predictions will be always propagated to the root instead of attempting the prediction of less specific but easier terms, that could improve overall prediction accuracy

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call