Abstract

BackgroundWe participated in the BioCreAtIvE Task 2, which addressed the annotation of proteins into the Gene Ontology (GO) based on the text of a given document and the selection of evidence text from the document justifying that annotation. We approached the task utilizing several combinations of two distinct methods: an unsupervised algorithm for expanding words associated with GO nodes, and an annotation methodology which treats annotation as categorization of terms from a protein's document neighborhood into the GO.ResultsThe evaluation results indicate that the method for expanding words associated with GO nodes is quite powerful; we were able to successfully select appropriate evidence text for a given annotation in 38% of Task 2.1 queries by building on this method. The term categorization methodology achieved a precision of 16% for annotation within the correct extended family in Task 2.2, though we show through subsequent analysis that this can be improved with a different parameter setting. Our architecture proved not to be very successful on the evidence text component of the task, in the configuration used to generate the submitted results.ConclusionThe initial results show promise for both of the methods we explored, and we are planning to integrate the methods more closely to achieve better results overall.

Highlights

  • We participated in the BioCreAtIvE Task 2, which addressed the annotation of proteins into the Gene Ontology (GO) based on the text of a given document and the selection of evidence text from the document justifying that annotation

  • Results were evaluated by professional annotators from the European Bioinformatics Institute (EBI) by considering the evidence text according to two criteria – whether the evidence text included a reference to the correct protein, and whether the evidence text directly referenced the GO node returned as the annotation

  • There is still significant room for improvement on this task. This is evidence of the complexities of automatic annotation of GO nodes to proteins based on a single document, where complexities arise both from the structure of the GO itself and the difficulties of annotating into a large and extremely hierarchical structure, and from the ambiguous nature of text

Read more

Summary

Introduction

We participated in the BioCreAtIvE Task 2, which addressed the annotation of proteins into the Gene Ontology (GO) based on the text of a given document and the selection of evidence text from the document justifying that annotation. We approached the task utilizing several combinations of two distinct methods: an unsupervised algorithm for expanding words associated with GO nodes, and an annotation methodology which treats annotation as categorization of terms from a protein's document neighborhood into the GO. We addressed Task 2, the problem of annotation of a protein with a node in the Gene Ontology (GO, http://www.gene ontology.org) [1] based on the text of a given document, and the selection of evidence text justifying the predicted annotation. The second method approaches annotation as categorization of terms derived from the sentential neighborhoods of the given protein in the given document into nodes in the GO. The system incorporates Natural Language Processing (NLP) components such as a morphological (page number not for citation purposes)

Methods
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call