Abstract

Gene ontology (GO) and GO annotation are important resources for biological information management and knowledge discovery, but the speed of manual annotation became a major bottleneck of database curation. BioCreative IV GO annotation task aims to evaluate the performance of system that automatically assigns GO terms to genes based on the narrative sentences in biomedical literature. This article presents our work in this task as well as the experimental results after the competition. For the evidence sentence extraction subtask, we built a binary classifier to identify evidence sentences using reference distance estimator (RDE), a recently proposed semi-supervised learning method that learns new features from around 10 million unlabeled sentences, achieving an F1 of 19.3% in exact match and 32.5% in relaxed match. In the post-submission experiment, we obtained 22.1% and 35.7% F1 performance by incorporating bigram features in RDE learning. In both development and test sets, RDE-based method achieved over 20% relative improvement on F1 and AUC performance against classical supervised learning methods, e.g. support vector machine and logistic regression. For the GO term prediction subtask, we developed an information retrieval-based method to retrieve the GO term most relevant to each evidence sentence using a ranking function that combined cosine similarity and the frequency of GO terms in documents, and a filtering method based on high-level GO classes. The best performance of our submitted runs was 7.8% F1 and 22.2% hierarchy F1. We found that the incorporation of frequency information and hierarchy filtering substantially improved the performance. In the post-submission evaluation, we obtained a 10.6% F1 using a simpler setting. Overall, the experimental analysis showed our approaches were robust in both the two tasks.

Highlights

  • With the expansion of knowledge in biomedical domain, the curation of databases for biological entities such as genes, proteins, diseases and drugs, becomes increasingly important for information management and knowledge discovery

  • In TREC Genomics Track 2004 [4], there were two tasks: the first task was to retrieve articles for Gene ontology (GO) annotation, where the best performance was 27.9% F-score and 65.1% normalized utility obtained by a logistic regression with bag-of-words and medical subject headings (MeSH) features; the second task was to classify each article into high-level GO classes: molecular function, biological process or cellular component, with the best F-score of 56.1% using a bagof-words-based KNN classifier

  • It is promising to see that all the runs based on reference distance estimator (RDE) achieved better F1 than support vector machine (SVM) [11] and logistic regression [12], and the incorporation of RDE produced significant improvement on F1comparing the performance of Logistic regression (F1 17.4% and 28.6%) with the best run with RDE (F1 22.1% and 35.7%), which justified the success of the application of RDE to this task

Read more

Summary

Introduction

With the expansion of knowledge in biomedical domain, the curation of databases for biological entities such as genes, proteins, diseases and drugs, becomes increasingly important for information management and knowledge discovery. In TREC Genomics Track 2004 [4], there were two tasks: the first task was to retrieve articles for GO annotation, where the best performance was 27.9% F-score and 65.1% normalized utility obtained by a logistic regression with bag-of-words and MeSH features; the second task was to classify each article into high-level GO classes: molecular function, biological process or cellular component, with the best F-score of 56.1% using a bagof-words-based KNN classifier These two tasks were both simplified version of GO annotation process, since they did not assign exact GO terms to certain gene. Based on the results it is no doubt that the task was rather difficult and the state-of-the-art performance was far from the requirement of practical use

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call