Generation of open biomedical datasets through ontology-driven transformation and integration processes.

María Del Carmen Legaz-García,Jesualdo Tomás Fernández-Breis,Marcos Menárguez-Tortosa,José Antonio Miñarro-Giménez

doi:10.1186/s13326-016-0075-z

María Del Carmen Legaz-García, Jesualdo Tomás Fernández-Breis + Show 2 more

Open Access

https://doi.org/10.1186/s13326-016-0075-z

Copy DOI

Abstract

BackgroundBiomedical research usually requires combining large volumes of data from multiple heterogeneous sources, which makes difficult the integrated exploitation of such data. The Semantic Web paradigm offers a natural technological space for data integration and exploitation by generating content readable by machines. Linked Open Data is a Semantic Web initiative that promotes the publication and sharing of data in machine readable semantic formats.MethodsWe present an approach for the transformation and integration of heterogeneous biomedical data with the objective of generating open biomedical datasets in Semantic Web formats. The transformation of the data is based on the mappings between the entities of the data schema and the ontological infrastructure that provides the meaning to the content. Our approach permits different types of mappings and includes the possibility of defining complex transformation patterns. Once the mappings are defined, they can be automatically applied to datasets to generate logically consistent content and the mappings can be reused in further transformation processes.ResultsThe results of our research are (1) a common transformation and integration process for heterogeneous biomedical data; (2) the application of Linked Open Data principles to generate interoperable, open, biomedical datasets; (3) a software tool, called SWIT, that implements the approach. In this paper we also describe how we have applied SWIT in different biomedical scenarios and some lessons learned.ConclusionsWe have presented an approach that is able to generate open biomedical repositories in Semantic Web formats. SWIT is able to apply the Linked Open Data principles in the generation of the datasets, so allowing for linking their content to external repositories and creating linked open datasets. SWIT datasets may contain data from multiple sources and schemas, thus becoming integrated datasets.

Highlights

Biomedicine is a knowledge based discipline, in which the production of knowledge from data is a daily activity
Methods we describe the methods included in our approach for the generation of the open biomedical datasets
To the best of our knowledge, current Ontology-Based Data Access (OBDA) approaches do not facilitate the application of ontology patterns as we do in this work, which permits a semantically-richer representation and exploitation of data

Summary

Introduction

Biomedicine is a knowledge based discipline, in which the production of knowledge from data is a daily activity. Biomedical research usually requires combining large volumes of data from multiple heterogeneous sources, which makes difficult the integrated exploitation of such data. Biomedical data The term biomedical data covers a wide range of types of data used in biomedicine Such data are usually stored and represented in different, heterogeneous formats, which makes their joint exploitation difficult. The information about a concrete biomedical entity, like a protein, is distributed along many different databases, which makes necessary to combine information from different sources to get all the information These heterogeneous resources do not even share identifiers for the biological entities, this particular aspect is being addressed by initiatives like identifiers.org [25]. Biomedical resources such as the Gene Ontology [15] or CHEBI [27] provide their data in relational format

Objectives

Methods

Results

Discussion

Conclusion