Abstract
BackgroundThe use of ontologies to control vocabulary and structure annotation has added value to genome-scale data, and contributed to the capture and re-use of knowledge across research domains. Gene Ontology (GO) is widely used to capture detailed expert knowledge in genomic-scale datasets and as a consequence has grown to contain many terms, making it unwieldy for many applications. To increase its ease of manipulation and efficiency of use, subsets called GO slims are often created by collapsing terms upward into more general, high-level terms relevant to a particular context. Creation of a GO slim currently requires manipulation and editing of GO by an expert (or community) familiar with both the ontology and the biological context. Decisions about which terms to include are necessarily subjective, and the creation process itself and subsequent curation are time-consuming and largely manual.ResultsHere we present an objective framework for generating customised ontology slims for specific annotated datasets, exploiting information latent in the structure of the ontology graph and in the annotation data. This framework combines ontology engineering approaches, and a data-driven algorithm that draws on graph and information theory. We illustrate this method by application to GO, generating GO slims at different information thresholds, characterising their depth of semantics and demonstrating the resulting gains in statistical power.ConclusionsOur GO slim creation pipeline is available for use in conjunction with any GO-annotated dataset, and creates dataset-specific, objectively defined slims. This method is fast and scalable for application to other biomedical ontologies.
Highlights
The use of ontologies to control vocabulary and structure annotation has added value to genomescale data, and contributed to the capture and re-use of knowledge across research domains
In this paper we introduce a general approach, based on ontology management principles, graph theory and information theory, for the automated generation of ontology slims based on information obtained from both annotations and the ontology structure, and we illustrate the application of this method to the generation of high-quality Gene Ontology (GO) slims at a series of information content thresholds
GO slim for yeast Here we analyse a set of GO slims generated across a range of information content thresholds on the yeast GO annotation contained in the Saccharomyces Genome Database (SGD) database [27], and compare them with the manually created yeast GO slim maintained by the yeast community
Summary
The use of ontologies to control vocabulary and structure annotation has added value to genomescale data, and contributed to the capture and re-use of knowledge across research domains. The Gene Ontology Consortium, which is responsible for the ongoing development of GO, draws its members from a number of organism-specific databases including FlyBase [2], Mouse Genome Database [3], WormBase [4], the Arabidopsis Information Resource [5], and the Zebrafish Information Network [6] These consortium members, and others such as the Gene Ontology Annotation Database [7,8], produce GO annotations for public use. This community structure has contributed to the broad acceptance and adoption of GO as the primary controlled vocabulary for molecular genetics and genomics
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.