Abstract

Precisely identifying arbitrary subsets of data so that these can be reproduced is a daunting challenge in data-driven science, the more so if the underlying data source is dynamically evolving. Yet an increasing number of settings exhibit exactly those characteristics. Larger amounts of data are being continuously ingested from a range of sources (be it sensor values, online questionnaires, documents, etc.), with error correction and quality improvement processes adding to the dynamics. Yet, for studies to be reproducible, for decision-making to be transparent, and for meta studies to be performed conveniently, having a precise identification mechanism to reference, retrieve, and work with such data is essential. The Research Data Alliance (RDA) Working Group on Dynamic Data Citation has published 14 recommendations that are centered around time-stamping and versioning evolving data sources and identifying subsets dynamically via persistent identifiers that are assigned to the queries selecting the respective subsets. These principles are generic and work for virtually any kind of data. In the past few years numerous repositories around the globe have implemented these recommendations and deployed solutions. We provide an overview of the recommendations, reference implementations, and pilot systems deployed and then analyze lessons learned from these implementations. This article provides a basis for institutions and data stewards considering adding this functionality to their data systems.

Highlights

  • Accountability and transparency in automated decisions (ACM US Public Policy Council, 2017) have important implications on the way we perform studies, analyze data, and prepare the basis for data-driven decision making

  • In order to identify reproducible subsets for data citation, sharing and reuse of data 14 recommendations were formulated by the Working Group on Data Citation (WGDC) of the Research Data Alliance (RDA)

  • While there is broad acceptance in the information science community, as evidenced through the Joint Declaration of Data Citation Principles (Data Citation Synthesis Group, 2014), the actual practice is still evolving, especially for citing dynamic data (Parsons, Duerr, & Jones, 2019) multiple implementations both conceptual and in practice, especially those briefly presented in this article, suggest that the RDA Recommendations present a valid, viable, and adaptable approach that may be emerging as a community standard

Read more

Summary

Introduction

Accountability and transparency in automated decisions (ACM US Public Policy Council, 2017) have important implications on the way we perform studies, analyze data, and prepare the basis for data-driven decision making. By assigning a persistent identifier (PID, e.g., a Digital Object Identifier, DOI) to these queries they become resolvable and can be reexecuted transparently against the time-stamped database to re-create the exact same subset that was initially selected This eliminates the need for predefined subsets that are frozen at predefined intervals, avoids data duplication, and is transparent to the researcher, while at the same time being applicable to virtually all types of data, such as databases, spreadsheets, collections of files, or an individual image.

RDA Recommendations on Dynamic Data Citation
Proof-of-Concept Implementations
Relational Databases
File-based Data via Git
XML Databases
NoSQL-based Data Citation Support Added to CKAN
Pilot Adopters and Deployments
Deep Carbon Observatory
Ocean Network Canada
Discussion and Lessons
Conclusions
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call