An empirical meta-analysis of the life sciences linked open data on the web

Maulik R Kamdar,Mark A Musen

doi:10.1038/s41597-021-00797-y

Maulik R Kamdar, Mark A Musen

Open Access

https://doi.org/10.1038/s41597-021-00797-y

Copy DOI

Journal: Scientific Data	Publication Date: Jan 21, 2021
Citations: 11	License type: open-access

Affiliation: Stanford University, RELX Group (United States)

Abstract

While the biomedical community has published several “open data” sources in the last decade, most researchers still endure severe logistical and technical challenges to discover, query, and integrate heterogeneous data and knowledge from multiple sources. To tackle these challenges, the community has experimented with Semantic Web and linked data technologies to create the Life Sciences Linked Open Data (LSLOD) cloud. In this paper, we extract schemas from more than 80 biomedical linked open data sources into an LSLOD schema graph and conduct an empirical meta-analysis to evaluate the extent of semantic heterogeneity across the LSLOD cloud. We observe that several LSLOD sources exist as stand-alone data sources that are not inter-linked with other sources, use unpublished schemas with minimal reuse or mappings, and have elements that are not useful for data integration from a biomedical perspective. We envision that the LSLOD schema graph and the findings from this research will aid researchers who wish to query and integrate data and knowledge from multiple biomedical sources simultaneously on the Web.

Highlights

In subsequent sections of this paper, we provide a brief overview on the Semantic Web technologies and the Life Sciences Linked Open Data (LSLOD) cloud, delve deeper into the different biomedical data and knowledge sources that were used in this research, and provide an overview on the different methods used to extract schemas and vocabularies from publicly available Linked Open Data (LOD) graphs and to evaluate the reuse and similarity of content across the LSLOD cloud
We established the following criteria for an LSLOD source to be included in the meta-analysis: (i) Each LSLOD source must have a functional SPARQL endpoint, (ii) For cases when an LSLOD source does not have a functional SPARQL endpoint, the source should be available as Resource Description Framework (RDF) data dumps that can be downloaded and stored in a local SPARQL repository, and (iii) Each LSLOD source must have at least 1,000 instances under any classification scheme that can be queried through the SPARQL endpoint
We found that RDF graphs may exhibit semantic mismatch where instances are aligned to classes from an exhaustive ontology, such as ChEBI6 or NCIT15 that have more than 50,000 classes, using the rdf:type property

Summary

Introduction

The biomedical research community has published and made available, on the Web, several sources consisting of biomedical data and knowledge: anonymized medical records[1], imaging data[2], sequencing data[3], biomedical publications[4], biological assays[5], chemical compounds and their activities[6], biological molecules and their characteristics[7], knowledge encoded in biological pathways[8,9], animal models[10], drugs and their protein targets[11], medical knowledge on organs, symptoms, diseases, and adverse reactions[12]. To achieve the goals of a Linked Open Data (LOD) cloud over the Web, the Semantic Web community has developed several standards, languages, and technologies that aim to provide a common framework for data and knowledge representation — linking, sharing, and querying, across application, enterprise, and community boundaries. Semantic Web languages and technologies have been used to represent and link data and knowledge sources from several different fields such as life sciences, geography, economics, media, and statistics, to essentially create a linked network of these sources and to provide a scalable infrastructure for structured querying of multiple heterogeneous sources simultaneously, for Web-scale computation, and for seamless integration of big data. The Resource Description Framework (RDF), a simple, standard triple-based model for data interchange on the Web, is used to represent information resources (e.g., biomedical entities, relations, classes) as linked graphs on the LSLOD cloud[35]. RDF is essentially only a triple-based, schema-less modeling language, semantics expressed in RDFS vocabularies and OWL ontologies can be exploited by computer programs, called reasoners, to verify the consistency of assertions in an RDF graph and to generate novel inferences[13]

Methods

Results

Discussion

Conclusion