Abstract

Gene Ontology is used extensively in scientific knowledgebases and repositories to organize a wealth of biological information. However, interpreting annotations derived from differential gene lists is often difficult without manually sorting into higher-order categories. To address these issues, we present GOcats, a novel tool that organizes the Gene Ontology (GO) into subgraphs representing user-defined concepts, while ensuring that all appropriate relations are congruent with respect to scoping semantics. We tested GOcats performance using subcellular location categories to mine annotations from GO-utilizing knowledgebases and evaluated their accuracy against immunohistochemistry datasets in the Human Protein Atlas (HPA). In comparison to term categorizations generated from UniProt's controlled vocabulary and from GO slims via OWLTools' Map2Slim, GOcats outperformed these methods in its ability to mimic human-categorized GO term sets. Unlike the other methods, GOcats relies only on an input of basic keywords from the user (e.g. biologist), not a manually compiled or static set of top-level GO terms. Additionally, by identifying and properly defining relations with respect to semantic scope, GOcats can utilize the traditionally problematic relation, has_part, without encountering erroneous term mapping. We applied GOcats in the comparison of HPA-sourced knowledgebase annotations to experimentally-derived annotations provided by HPA directly. During the comparison, GOcats improved correspondence between the annotation sources by adjusting semantic granularity. GOcats enables the creation of custom, GO slim-like filters to map fine-grained gene annotations from gene annotation files to general subcellular compartments without needing to hand-select a set of GO terms for categorization. Moreover, GOcats can customize the level of semantic specificity for annotation categories. Furthermore, GOcats enables a safe and more comprehensive semantic scoping utilization of go-core, allowing for a more complete utilization of information available in GO. Together, these improvements can impact a variety of GO knowledgebase data mining use-cases as well as knowledgebase curation and quality control.

Highlights

  • Gene Ontology (GO)The Gene Ontology (GO) [1] is the most common biology-focused controlled vocabulary (CV) used to represent information and knowledge distilled from most biological and biomedical research data generated today, from classic wet-bench experiments to high-throughput analytical platforms, especially omics technologies

  • GO Categorization Suite (GOcats) keywords represented within the GO Cellular Component sub-ontology

  • While keyword querying of GO provided an initial seeding of the growing subgraph, Table 1 highlights the necessity of re-analyzing the GO graph, both to remove terms erroneously added by the keyword search and to add appropriate subgraph terms not captured by the keyword search

Read more

Summary

Introduction

The Gene Ontology (GO) [1] is the most common biology-focused controlled vocabulary (CV) used to represent information and knowledge distilled from most biological and biomedical research data generated today, from classic wet-bench experiments to high-throughput analytical platforms, especially omics technologies. GO is divided into three sub-ontologies: Cellular Component, Molecular Function, and Biological Process. A graph represents each sub-ontology, where individual GO terms are nodes connected by directional edges (i.e. relation). The term “lobed nucleus” (GO:0098537) is connected by a directional is_a relation edge to the term “nucleus” (GO:0005634). In this graph context, the is_a relation defines the term “nucleus” as a parent of the term “lobed nucleus”. The three GO sub-ontologies are “is_a disjoint” meaning that there are no is_a relations connecting any node among the three sub-ontologies

Methods
Results
Discussion
Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.