Abstract

The ability to measure the use and impact of published data sets is key to the success of the open data/open science paradigm. A direct measure of impact would require tracking data (re)use in the wild, which is difficult to achieve. This is therefore commonly replaced by simpler metrics based on data download and citation counts. In this paper we describe a scenario where it is possible to track the trajectory of a dataset after its publication, and show how this enables the design of accurate models for ascribing credit to data originators. A Data Trajectory (DT) is a graph that encodes knowledge of how, by whom, and in which context data has been re-used, possibly after several generations. We provide a theoretical model of DTs that is grounded in the W3C PROV data model for provenance, and we show how DTs can be used to automatically propagate a fraction of the credit associated with transitively derived datasets, back to original data contributors. We also show this model of transitive credit in action by means of a Data Reuse Simulator. In the longer term, our ultimate hope is that credit models based on direct measures of data reuse will provide further incentives to data publication. We conclude by outlining a research agenda to address the hard questions of creating, collecting, and using DTs systematically across a large number of data reuse instances in the wild.

Highlights

  • The practice of publishing Research Data has been maturing rapidly, following increasing evidence that the combination of data sharing and emerging data citation practices represent new opportunities for extending the value chain of the data, rather than a threatDraft from 16th January 2016The International Digital Curation Conference takes place on [TBC] in [TBC]

  • The main hypothesis that motivates our research is that knowledge of Data Trajectory (DT) makes it possible to quantify the impact and influence of Research Data through several generations of reuse and derivation, transitively

  • With the understanding that many possible such models can be defined, we have implemented a data reuse simulator,9 which we use as a research tool for exploring different credit models, and for understanding their implications for data publishers. These contributions are designed to lay the foundations for further research in the area of data reuse analysis based on provenance

Read more

Summary

Introduction

The practice of publishing Research Data has been maturing rapidly, following increasing evidence that the combination of data sharing and emerging data citation practices represent new opportunities for extending the value chain of the data, rather than a threat. About the lifetime of those datasets after their publication, namely the knowledge of how, by whom, and in which context they have been re-used, and whether such instances of re-use have produced interesting derived data products, possibly after several generations We refer to this new type of knowledge as the trajectories of published data (Data Trajectories, or DT). The main hypothesis that motivates our research is that knowledge of DTs makes it possible to quantify the impact and influence of Research Data through several generations of reuse and derivation, transitively This will lead to new notions of transitive credit to data owners, which may inform and extend current data citation practices. Amongst these is (Katz, 2014), where the concept is not fully formalised nor made operational through metadata management and analysis

Challenges in tracking data reuse and the role of data citation
Provlets and Data Trajectories
Research Objects
The PROV model for provenance
Data Trajectories
From data trajectories to transitive credit for data owners
RO derivation with unknown activity
Model summary
Simulating Data Trajectories and credit propagation
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call