Enabling ad-hoc reuse of private data repositories through schema extraction

Lars Christoph Gleim,Lukas Zimmermann,Md Rezaul Karim,Oliver Kohlbacher,Oya Beyan,Stefan Decker,Holger Stenzhorn

doi:10.1186/s13326-020-00223-z

Abstract

BackgroundSharing sensitive data across organizational boundaries is often significantly limited by legal and ethical restrictions. Regulations such as the EU General Data Protection Rules (GDPR) impose strict requirements concerning the protection of personal and privacy sensitive data. Therefore new approaches, such as the Personal Health Train initiative, are emerging to utilize data right in their original repositories, circumventing the need to transfer data.ResultsCircumventing limitations of previous systems, this paper proposes a configurable and automated schema extraction and publishing approach, which enables ad-hoc SPARQL query formulation against RDF triple stores without requiring direct access to the private data. The approach is compatible with existing Semantic Web-based technologies and allows for the subsequent execution of such queries in a safe setting under the data provider’s control. Evaluation with four distinct datasets shows that a configurable amount of concise and task-relevant schema, closely describing the structure of the underlying data, was derived, enabling the schema introspection-assisted authoring of SPARQL queries.ConclusionsAutomatically extracting and publishing data schema can enable the introspection-assisted creation of data selection and integration queries. In conjunction with the presented system architecture, this approach can enable reuse of data from private repositories and in settings where agreeing upon a shared schema and encoding a priori is infeasible. As such, it could provide an important step towards reuse of data from previously inaccessible sources and thus towards the proliferation of data-driven methods in the biomedical domain.

Highlights

Sharing sensitive data across organizational boundaries is often significantly limited by legal and ethical restrictions
Methods we propose an automated approach for extracting task-specific schema from Resource description framework (RDF) data sources in order to enable the efficient formulation of SPARQL protocol and RDF query language (SPARQL) data selection and integration queries without direct access to the data
In the context of RDF data, the fundamental knowledge required for the creation of SPARQL queries for data selection and integration consists of the various rdf:type objects, the rdf:Property predicates and the structural relations between them

Summary

Introduction

Sharing sensitive data across organizational boundaries is often significantly limited by legal and ethical restrictions Regulations such as the EU General Data Protection Rules (GDPR) impose strict requirements concerning the protection of personal and privacy sensitive data. In order to enable data economy in privacy-sensitive domains and effective reuse of existing data and research, novel approaches are emerging to overcome these limitations. One of those approaches is the Personal Health Train (PHT) framework [14], which aims to bring algorithms and statistical models to data sources, rather than sharing data with the third parties such as researchers. Unless there are universally agreed information models and data set descriptions, there is a need to create and communicate a schema – that is information about the structure of the data – to enable writing queries for heterogeneous data resources

Methods

Results

Discussion

Conclusion