Abstract

BackgroundThe automation of data analysis in the form of scientific workflows has become a widely adopted practice in many fields of research. Computationally driven data-intensive experiments using workflows enable automation, scaling, adaptation, and provenance support. However, there are still several challenges associated with the effective sharing, publication, and reproducibility of such workflows due to the incomplete capture of provenance and lack of interoperability between different technical (software) platforms.ResultsBased on best-practice recommendations identified from the literature on workflow design, sharing, and publishing, we define a hierarchical provenance framework to achieve uniformity in provenance and support comprehensive and fully re-executable workflows equipped with domain-specific information. To realize this framework, we present CWLProv, a standard-based format to represent any workflow-based computational analysis to produce workflow output artefacts that satisfy the various levels of provenance. We use open source community-driven standards, interoperable workflow definitions in Common Workflow Language (CWL), structured provenance representation using the W3C PROV model, and resource aggregation and sharing as workflow-centric research objects generated along with the final outputs of a given workflow enactment. We demonstrate the utility of this approach through a practical implementation of CWLProv and evaluation using real-life genomic workflows developed by independent groups.ConclusionsThe underlying principles of the standards utilized by CWLProv enable semantically rich and executable research objects that capture computational workflows with retrospective provenance such that any platform supporting CWL will be able to understand the analysis, reuse the methods for partial reruns, or reproduce the analysis to validate the published findings.

Highlights

  • Amongst the many big data domains, genomics is considered the most demanding with respect to all stages of the data lifecycle, including acquisition, storage, distribution, and analysis [1]

  • The steps described above were taken to produce research object (RO), which were used to re-enact the workflows, without any further changes required. This demonstration illustrated the syntactic and semantic interoperability of the workflows across different systems. It shows that both Common Workflow Language (CWL) executors were able to exchange, comprehend, and use the information represented as CWLProv ROs

  • CWLProv and interoperability CWL already builds on technologies such as JavaScript Object Notation for Linked Data (JSON-LD) [99] for data modeling and Docker [26] to support portability of the runtime environments

Read more

Summary

Introduction

Amongst the many big data domains, genomics is considered the most demanding with respect to all stages of the data lifecycle, including acquisition, storage, distribution, and analysis [1]. Results: Based on best-practice recommendations identified from the literature on workflow design, sharing, and publishing, we define a hierarchical provenance framework to achieve uniformity in provenance and support comprehensive and fully re-executable workflows equipped with domain-specific information To realize this framework, we present CWLProv, a standard-based format to represent any workflow-based computational analysis to produce workflow output artefacts that satisfy the various levels of provenance. “Prospective provenance” refers to the “recipes” used to capture a set of computational tasks and their order, e.g., the workflow specification [16] This is typically given as an abstract representation of the steps (tools/data analysis steps) that are necessary to create a particular research output, e.g., a data artefact. Our focus is mainly on improving representation and capture of retrospective provenance

Objectives
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call