Abstract

BackgroundLarge-scale sequencing projects have now become routine lab practice and this has led to the development of a new generation of tools involving function prediction methods, bringing the latter back to the fore. The advent of Gene Ontology, with its structured vocabulary and paradigm, has provided computational biologists with an appropriate means for this task.MethodologyWe present here a novel method called ARGOT (Annotation Retrieval of Gene Ontology Terms) that is able to process quickly thousands of sequences for functional inference. The tool exploits for the first time an integrated approach which combines clustering of GO terms, based on their semantic similarities, with a weighting scheme which assesses retrieved hits sharing a certain number of biological features with the sequence to be annotated. These hits may be obtained by different methods and in this work we have based ARGOT processing on BLAST results.ConclusionsThe extensive benchmark involved 10,000 protein sequences, the complete S. cerevisiae genome and a small subset of proteins for purposes of comparison with other available tools. The algorithm was proven to outperform existing methods and to be suitable for function prediction of single proteins due to its high degree of sensitivity, specificity and coverage.

Highlights

  • The amount of data available in public databases has reached an unprecedented complexity which is not manageable by users. 789 genomes have been completed and over 1,600 are in progress assembly

  • It uses a structured controlled vocabulary organized in a hierarchical Directed Acyclic Graph (DAG) that has two important characteristics: it has become an acknowledged and widely used framework for functional annotation and it is designed to be exploited by computational methods [7]

  • A further check was carried out to calculate the real coverage of unique GO terms represented in the test set compared to their total in the Gene Ontology graph

Read more

Summary

Introduction

The amount of data available in public databases has reached an unprecedented complexity which is not manageable by users. 789 genomes have been completed and over 1,600 are in progress assembly (as of November 2008). The definition of protein function itself is elusive and ambiguous as it depends on i) context: where the protein acts and its behavior in particular conditions; ii) scale: the level at which functional assignment is reported, namely molecular or cellular and organismal; iii) time: when and for how long a certain protein operates in the cell’s life-span [3,4]. Against this background, the Gene Ontology (GO) consortium has developed a successful solution that may be considered the gold standard in functional classification [5,6]. The advent of Gene Ontology, with its structured vocabulary and paradigm, has provided computational biologists with an appropriate means for this task

Methods
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call