Abstract

BackgroundThe Gene Ontology is a controlled vocabulary for representing knowledge related to genes and proteins in a computable form. The current effort of manually annotating proteins with the Gene Ontology is outpaced by the rate of accumulation of biomedical knowledge in literature, which urges the development of text mining approaches to facilitate the process by automatically extracting the Gene Ontology annotation from literature. The task is usually cast as a text classification problem, and contemporary methods are confronted with unbalanced training data and the difficulties associated with multi-label classification.ResultsIn this research, we investigated the methods of enhancing automatic multi-label classification of biomedical literature by utilizing the structure of the Gene Ontology graph. We have studied three graph-based multi-label classification algorithms, including a novel stochastic algorithm and two top-down hierarchical classification methods for multi-label literature classification. We systematically evaluated and compared these graph-based classification algorithms to a conventional flat multi-label algorithm. The results indicate that, through utilizing the information from the structure of the Gene Ontology graph, the graph-based multi-label classification methods can significantly improve predictions of the Gene Ontology terms implied by the analyzed text. Furthermore, the graph-based multi-label classifiers are capable of suggesting Gene Ontology annotations (to curators) that are closely related to the true annotations even if they fail to predict the true ones directly. A software package implementing the studied algorithms is available for the research community.ConclusionThrough utilizing the information from the structure of the Gene Ontology graph, the graph-based multi-label classification methods have better potential than the conventional flat multi-label classification approach to facilitate protein annotation based on the literature.

Highlights

  • The Gene Ontology is a controlled vocabulary for representing knowledge related to genes and proteins in a computable form

  • We constructed a PubMed augmented GO graph using the Biological Process branch of the GO combined with the Gene Ontology Annotation (GOA) [18] corpus

  • A node represents a GO term, an edge represents the semantic relationship between a pair of GO term, and the structure of the graph follows the definition of the Biological Process ontology from the Gene Ontology Consortium

Read more

Summary

Introduction

The Gene Ontology is a controlled vocabulary for representing knowledge related to genes and proteins in a computable form. The current effort of manually annotating proteins with the Gene Ontology is outpaced by the rate of accumulation of biomedical knowledge in literature, which urges the development of text mining approaches to facilitate the process by automatically extracting the Gene Ontology annotation from literature. The. Gene Ontology (GO) [1] is a controlled vocabulary used to represent molecular biology concepts, which is the de facto standard for annotating genes/proteins. The process of extracting biological concepts from biomedical literature to annotate genes/proteins is manually performed by domain experts, whose roles are indispensable to ensure the accuracy of the acquired knowledge. Similar tasks were investigated in the genomic track of the Text REtrieval Conference (TREC) [4]

Objectives
Methods
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call