Who Done It? Reproducibility of Data Products Also Requires Lineage to Determine Impact and Give Credit Where Credit is Due.

Lesley Wyborn,Jens Klump,Rebecca Farrington,Ben Evans,Nigel Rees,Tim Rawling

doi:10.5194/egusphere-egu23-12864

Abstract

Reproducible research necessitates full transparency and integrity in data collection (e.g. from observations) or generation of data, and further data processing and analysis to generate research products. However, Earth and environmental science data are growing in complexity, volume and variety and today, particularly for large-volume Earth observation and geophysics datasets, achieving this transparency is not easy. It is rare for a published data product to be created in a single processing event by a single author or individual research group. Modern research data processing pipelines/workflows can have quite complex lineages, and it is more likely that an individual research product is generated through multiple levels of processing, starting from raw instrument data at full resolution (L0) followed by successive levels of processing (L1-L4), which progressively convert raw instrument data into more useful parameters and formats. Each individual level of processing can be undertaken by different research groups using a variety of funding sources: rarely are those involved in the early stages of processing/funding properly cited.The lower levels of processing are where observational data essentially remains at full resolution and is calibrated, georeferenced and processed to sensor units (L1) and then geophysical variables are derived (L2). Historically, particularly where the volumes of the L0-L2 datasets are measured in Terabytes to Petabytes, processing could only be undertaken by a minority of specialised scientific research groups and data providers, as few had the expertise/resources/infrastructures to process them on-premise. Wider availability of colocated data assets and HPC/cloud processing means that the full resolution, less processed forms of observational data can now be processed remotely in realistic timeframes by multiple researchers to their specific processing requirements, and also enables greater exploration of parameter space allowing multiple values for the same inputs to be trialled. The advantage is that better-targeted research products can now be rapidly produced. However, the downside is that far greater care needs to be taken to ensure that there is sufficient machine-readable metadata and provenance information to enable any user to determine what processing steps and input parameters were used in each part of the lineage of any released dataset/data product, as well as be able to reference exactly who undertook any part of the acquisition/processing and identify sources of funding (including instruments/field campaigns that collected the data).The use of Persistent Identifiers (PIDs) for any component objects (observational data, synthetic data, software, model inputs, people, instruments, grants, organisations, etc.) will be critical. Global and interdisciplinary research teams of the future will be reliant on software engineers to develop community-driven software environments that aid and enhance the transparency and reproducibility of their scientific workflows and ensure recogniton. The advantage of the PID approach is that not only will reproducibility and transparency be enhanced, but through the use of Knowledge Graphs it will also be possible to trace the input of any researcher at any level of processing, whilst funders will be able to determine the impact of each stage from the raw data capture through to any derivative high-level data product.&#160;&#160;

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Who Done It? Reproducibility of Data Products Also Requires Lineage to Determine Impact and Give Credit Where Credit is Due.

Abstract

Talk to us

Similar Papers

Lead the way for us

Similar Papers

Who has got what where? FAIR-ly coordinating multiple levels of geophysical data products over distributed Research Infrastructures (RIs) to meet diverse computational needs and capabilities of users.
Lesley Wyborn ... Jo Croucher
-
Lesley Wyborn, et. al.Lesley Wyborn ... Jo Croucher
09 Mar 2024
09 Mar 2024

Persistent Identifiers and the Next Generation of Legal Scholarship
Aaron Retteen ... Malikah Hall
SSRN Electronic Journal | VOL. -
Aaron Retteen, et. al.Aaron Retteen ... Malikah Hall
01 Jan 2020
SSRN Electronic Journal | VOL. -

Identifiers for Earth Science Data Sets: Where We Have Been and Where We Need to Go
Justin C Goldstein ... Matthew S Mayernik
Data Science Journal | VOL. 16
Justin C Goldstein, et. al.Justin C Goldstein ... Matthew S Mayernik
21 Apr 2017
Data Science Journal | VOL. 16

Spatial and Open Research Data Infrastructure for Planetary Science - Lessons learned from European developments
Andrea Nass ... Kristine Asch
-
Andrea Nass, et. al.Andrea Nass ... Kristine Asch
08 Oct 2020
08 Oct 2020

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Who Done It? Reproducibility of Data Products Also Requires Lineage to Determine Impact and Give Credit Where Credit is Due.

Abstract

Talk to us

Similar Papers