Abstract

Large-scale datasets present unique opportunities to perform scientific investigations with unprecedented breadth. However, they also pose considerable challenges for the findability, accessibility, interoperability, and reusability (FAIR) of research outcomes due to infrastructure limitations, data usage constraints, or software license restrictions. Here we introduce a DataLad-based, domain-agnostic framework suitable for reproducible data processing in compliance with open science mandates. The framework attempts to minimize platform idiosyncrasies and performance-related complexities. It affords the capture of machine-actionable computational provenance records that can be used to retrace and verify the origins of research outcomes, as well as be re-executed independent of the original computing infrastructure. We demonstrate the framework’s performance using two showcases: one highlighting data sharing and transparency (using the studyforrest.org dataset) and another highlighting scalability (using the largest public brain imaging dataset available: the UK Biobank dataset).

Highlights

  • Storage and computational demands strain the capabilities of even well-endowed research institutions’ high-performance compute (HPC) infrastructure — rendering the analysis of these datasets unaffordable using methods common in fields accustomed to smaller datasets

  • We developed a custom extension, datalad-ukbiobank[41], to use the UK Biobank (UKB) as a data source for reproducible research

  • Distributed data transport and storage logistics offer flexibility to adapt to particular computing infrastructure

Read more

Summary

Introduction

The amount of data available to researchers has steadily grown, but over the past decade, a focus on diverse, representative samples has resulted in datasets of unprecedented size. The Wind Integration National Dataset (WIND) Toolkit[1], CERN data (opendata.cern.ch), or NASA Earth data (earthdata.nasa.gov) are only some of the prominent examples of large, openly shared datasets across scientific disciplines This development is accompanied by a growing awareness of the importance to make the data more findable, accessible, interoperable, and reusable (FAIR)[2], and increasing availability of research standards and tools that facilitate data sharing and management[3]. It minimizes duplicate efforts to perform resource-heavy, costly computations that have considerable environmental impact[5], and it can open up research on large data to scholars who do not have access to adequate computational resources In such contexts, data should be as FAIR as possible, and handled in a sustainable manner that places data sharing and reuse as a priority

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call