Abstract

Protein function prediction is a complex multiclass multilabel classification problem, characterized by multiple issues such as the incompleteness of the available annotations, the integration of multiple sources of high dimensional biomolecular data, the unbalance of several functional classes, and the difficulty of univocally determining negative examples. Moreover, the hierarchical relationships between functional classes that characterize both the Gene Ontology and FunCat taxonomies motivate the development of hierarchy-aware prediction methods that showed significantly better performances than hierarchical-unaware “flat” prediction methods. In this paper, we provide a comprehensive review of hierarchical methods for protein function prediction based on ensembles of learning machines. According to this general approach, a separate learning machine is trained to learn a specific functional term and then the resulting predictions are assembled in a “consensus” ensemble decision, taking into account the hierarchical relationships between classes. The main hierarchical ensemble methods proposed in the literature are discussed in the context of existing computational methods for protein function prediction, highlighting their characteristics, advantages, and limitations. Open problems of this exciting research area of computational biology are finally considered, outlining novel perspectives for future research.

Highlights

  • Exploiting the wealth of biomolecular data accumulated by novel high-throughput biotechnologies, “in silico” protein function prediction can generate hypotheses to drive the biological discovery and validation of protein functions [1]

  • “in vitro” methods are costly in time and money, and automatic prediction methods can support the biologist in understanding the role of a protein or of a biological process or in annotating a new genome at high level of accuracy or more, in general, in solving problems of functional genomics [2]

  • The analysis of the per level classification performances shows that true path rule (TPR)-W, by exploiting a global strategy of classification, is able to achieve a good compromise between precision and recall, enhancing the F-measure at each level of the taxonomy [31]

Read more

Summary

Introduction

Exploiting the wealth of biomolecular data accumulated by novel high-throughput biotechnologies, “in silico” protein function prediction can generate hypotheses to drive the biological discovery and validation of protein functions [1]. “in vitro” methods are costly in time and money, and automatic prediction methods can support the biologist in understanding the role of a protein or of a biological process or in annotating a new genome at high level of accuracy or more, in general, in solving problems of functional genomics [2]. Unsupervised methods can be applied to AFP, due to the inherent difficulty of extracting functional classes without exploiting any available a priori information [3, 4]; usually supervised or semisupervised learning methods are applied in order to exploit the available a priori information about gene annotations. In order to exploit all the information available for each protein, we need to learn methods that are able to integrate different data sources [7]

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call