Abstract

BackgroundThe ability to query many independent biological databases using a common ontology-based semantic model would facilitate deeper integration and more effective utilization of these diverse and rapidly growing resources. Despite ongoing work moving toward shared data formats and linked identifiers, significant problems persist in semantic data integration in order to establish shared identity and shared meaning across heterogeneous biomedical data sources.ResultsWe present five processes for semantic data integration that, when applied collectively, solve seven key problems. These processes include making explicit the differences between biomedical concepts and database records, aggregating sets of identifiers denoting the same biomedical concepts across data sources, and using declaratively represented forward-chaining rules to take information that is variably represented in source databases and integrating it into a consistent biomedical representation. We demonstrate these processes and solutions by presenting KaBOB (the Knowledge Base Of Biomedicine), a knowledge base of semantically integrated data from 18 prominent biomedical databases using common representations grounded in Open Biomedical Ontologies. An instance of KaBOB with data about humans and seven major model organisms can be built using on the order of 500 million RDF triples. All source code for building KaBOB is available under an open-source license.ConclusionsKaBOB is an integrated knowledge base of biomedical data representationally based in prominent, actively maintained Open Biomedical Ontologies, thus enabling queries of the underlying data in terms of biomedical concepts (e.g., genes and gene products, interactions and processes) rather than features of source-specific data schemas or file formats. KaBOB resolves many of the issues that routinely plague biomedical researchers intending to work with data from multiple data sources and provides a platform for ongoing data integration and development and for formal reasoning over a wealth of integrated biomedical data.Electronic supplementary materialThe online version of this article (doi:10.1186/s12859-015-0559-3) contains supplementary material, which is available to authorized users.

Highlights

  • The ability to query many independent biological databases using a common ontology-based semantic model would facilitate deeper integration and more effective utilization of these diverse and rapidly growing resources

  • We have put together a set of five methods, some novel and some that build on prior work of others, that overcome problems commonly encountered when attempting to semantically integrate information from multiple data sources, When applied collectively these methods resolve seven key problems. (1) varying identifiers used across data sources to refer to the same concepts; (2) differing file formats using different lexicalizations and interpretations of identifiers; (3) conflation of informational entities with biomedical concepts; (4) the use of varying non-ontologically grounded semantic models; (5) errors and inconsistency among source data; (6) instability of identifiers and URIs over time in integrated resources; and (7) difficulty in tracing and reporting provenance of integrated data. We demonstrate these solutions by presenting KaBOB, a system that integrates 18 sources of biomedical data using 14 prominent Open Biomedical Ontologies (OBOs) as a foundation and vocabulary for modeling, facilitating interaction with the wealth of existing data and tools that already rely on these OBOs

  • One fully functional version has been built using only human source data and another using human data plus data for seven major eukaryotic model organisms: Mus musculus (10090), Rattus norvegicus (10116), Drosophila melanogaster (7227), Saccharomyces cerevisiae (4932), Caenorhabditis elegans (6239), Danio rerio (7955), and Arabidopsis thaliana (3702). (Data for subtaxa of these taxa represented in the NCBI Taxonomy, e.g., subspecies of mice and strains of yeast, have been included.) Ideally, a version of KaBOB would be built using data for every organism included in the data sources; that is out of reach of our current hardware/software configuration

Read more

Summary

Introduction

The ability to query many independent biological databases using a common ontology-based semantic model would facilitate deeper integration and more effective utilization of these diverse and rapidly growing resources. Much contemporary biomedical research depends on broad and unbiased assays at genomic scale. The aggregate consequences of this failure to capitalize on existing knowledge for the interpretation of genome-scale experimental results is a substantial reduction in the efficiency of the biomedical research enterprise writ large, delaying the development of both key insights and new therapies. The ability to query many independent biological databases using a common, community-driven semantic model would facilitate deeper integration and more effective utilization of these diverse and rapidly growing resources

Methods
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call