Abstract

BackgroundGene Ontology (GO) terms represent the standard for annotation and representation of molecular functions, biological processes and cellular compartments, but a large gap exists between the way concepts are represented in the ontology and how they are expressed in natural language text. The construction of highly specific GO terms is formulaic, consisting of parts and pieces from more simple terms.ResultsWe present two different types of manually generated rules to help capture the variation of how GO terms can appear in natural language text. The first set of rules takes into account the compositional nature of GO and recursively decomposes the terms into their smallest constituent parts. The second set of rules generates derivational variations of these smaller terms and compositionally combines all generated variants to form the original term. By applying both types of rules, new synonyms are generated for two-thirds of all GO terms and an increase in F-measure performance for recognition of GO on the CRAFT corpus from 0.498 to 0.636 is observed. Additionally, we evaluated the combination of both types of rules over one million full text documents from Elsevier; manual validation and error analysis show we are able to recognize GO concepts with reasonable accuracy (88 %) based on random sampling of annotations.ConclusionsIn this work we present a set of simple synonym generation rules that utilize the highly compositional and formulaic nature of the Gene Ontology concepts. We illustrate how the generated synonyms aid in improving recognition of GO concepts on two different biomedical corpora. We discuss other applications of our rules for GO ontology quality assurance, explore the issue of overgeneration, and provide examples of how similar methodologies could be applied to other biomedical terminologies. Additionally, we provide all generated synonyms for use by the text-mining community.Electronic supplementary materialThe online version of this article (doi:10.1186/s13326-016-0096-7) contains supplementary material, which is available to authorized users.

Highlights

  • Gene Ontology (GO) terms represent the standard for annotation and representation of molecular functions, biological processes and cellular compartments, but a large gap exists between the way concepts are represented in the ontology and how they are expressed in natural language text

  • Application of gene ontology synonym rules To explore the impact that our rules had on the recognition of concepts from the biomedical literature, we applied our synonym generation rules to two different version of the Gene Ontology and compared the concepts identified before/after application on two different biomedical corpora

  • Manual validation of gene ontology mentions we found an improvement in performance on the Colorado Richly Annotated Full-Text (CRAFT) corpus and on the larger corpus a significant number of additional concepts and mentions were identified through our synonym generation rules, we are hesitant to reach any further conclusions without some manual validation of the accuracy of these generated synonyms

Read more

Summary

Introduction

Gene Ontology (GO) terms represent the standard for annotation and representation of molecular functions, biological processes and cellular compartments, but a large gap exists between the way concepts are represented in the ontology and how they are expressed in natural language text. Due to its importance in biology and the exponential growth in the biomedical literature over the past years, there has been much effort in utilizing GO for text mining tasks [1, 2] Performance on these recognition tasks is lacking; it has been previously seen. 2) The mining of GO concepts from large collections of biomedical literature has been shown to be useful for biomedical discovery, for example, pharmacogenomic gene prediction [7] and protein function prediction [8, 9] Providing these discovery algorithms with cleaner, but more data, could increase the ability their accuracy of prediction and generalizability

Objectives
Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call