Abstract

Provenance of scientific workflows has been considered a mean to provide workflow reproducibility. However, the provenance approaches adopted so far are not applicable in the context of Cloud because the provenance trace lacks the Cloud information. This paper presents a novel approach that collects the Cloud-aware provenance and represents it as a graph. The workflow execution reproducibility on the Cloud is determined by comparing the workflow provenance at three levels i.e., workflow structure, execution infrastructure and workflow outputs. The experimental evaluation shows that the implemented approach can detect changes in the provenance traces and the outputs produced by the workflow.

Highlights

  • The scientific community is experiencing a data deluge due to the generation of large amounts of data in modern scientific experiments that include projects such as the Large Hadron Collider (LHC)1, and projects such as N4U [1] [?]

  • In order to evaluate this aspect of workflow reproducibility, an algorithm has been proposed that compares the outputs produced by two given workflows

  • In order to evaluate the affect of Cloud configuration on the workflow execution and to evaluate the proposed comparison approaches in ReCAP, three types of workflows from different scientific domains have been used

Read more

Summary

INTRODUCTION

The scientific community is experiencing a data deluge due to the generation of large amounts of data in modern scientific experiments that include projects such as the Large Hadron Collider (LHC), and projects such as N4U [1] [?]. RECAP: REPRODUCE WORKFLOW EXECUTION USING CLOUD-AWARE PROVENANCE ReCAP has been designed on the configuration and plugin-based mechanism With this mechanism, support for new workflow management systems, mapping algorithms etc. The key aspects of the ReCAP such as WMS components, mapping algorithms, persistence API that interacts with the workflow provenance, the ReCAP databases and the Cloud middleware are driven by the configuration parameters. These configurations (shown as ReCAP configs in Figure 4) are divided into seven main categories (see Table I). MIPS or KFLOPS are one way to specify the execution performance of a machine and it can affect a job execution performance

WMS Layer
WS Client
Cloud Layer
PROVENANCE COMPARISON
Workflow Graph Structure Comparison
Edgestask
Workflow Output Comparison
5: ComparisonCounter 0
RESULTS AND ANALYSIS
Infrastructure Re-provisioning
Structure Analysis
Infrastructure Analysis
Workflow Output Analysis
Discussion
Workflow Reproducibility using ReCAP
CONCLUSIONS AND FUTURE WORK
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call