Accelerate Literature Icon
Want to do a literature review? Try our new Literature Review workflow

Integrating Provenance Data from Distributed Workflow Systems with ProvManager

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon

Running scientific workflows in distributed environments is motivating the definition of provenance gathering approaches that are loosely coupled to the workflow execution engine. This kind of approach is interesting because it allows both storage and access to provenance data in an integrated way, even in an environment where different workflow systems work together. Therefore, we have proposed a provenance gathering strategy that is independent from the workflow system technology. This strategy has evolved into a provenance management system named ProvManager. In this paper we show how provenance data is captured along in a distributed execution environment with ProvManager and we show its web interface, in which scientists can register experiments, monitor workflow execution, and query provenance data.

Similar Papers
  • Research Article
  • Cite Count Icon 3
  • 10.1016/s0164-1212(02)00127-9
GM-WTA: An efficient workflow task allocation method in a distributed execution environment
  • May 3, 2003
  • The Journal of Systems & Software
  • Jin Hyun Son + 4 more

GM-WTA: An efficient workflow task allocation method in a distributed execution environment

  • Conference Article
  • Cite Count Icon 10
  • 10.1109/escience.2014.59
Towards an Adaptive and Distributed Architecture for Managing Workflow Provenance Data
  • Oct 1, 2014
  • Flavio Costa + 2 more

Workflow provenance data represents the workflow execution behavior, allowing for tracing the generation of the scientific data-flow. Provenance is an important asset to analyze data, identify and handle errors that occurred during the workflow execution through runtime monitoring. The workflow execution engine can also use provenance data to set the initial amount of resources and plan adaptive task scheduling. However, efficiently managing provenance data from distributed workflow execution has several challenges. As the size of workflows increases (in terms of number of activity executions or volume of data to process), the amount of provenance data to be managed also grows, especially in fine grain. Thus, centralized approaches become unviable. In this work we propose an architecture that combines distributed workflow management techniques with distributed provenance data management.

  • Conference Article
  • Cite Count Icon 11
  • 10.1109/escience.2015.50
Data Analytics in Bioinformatics: Data Science in Practice for Genomics Analysis Workflows
  • Aug 1, 2015
  • Kary A C S Ocaña + 3 more

Workflow systems manage large-scale experiments and deliver a large volume of provenance data traces. The provenance repository of these systems contains information about the workflow execution, which allows for tracking and analyzing data transformations. However, provenance data may still be considered a black-box, when it comes to analyze the contents of resulting data files. Current solutions are focused on data transformation at coarse grain, they point to input and output files, but do not allow for exploring domain-specific data. Data analytics is essential for managing large-scale workflows executed in parallel, especially when tracking anomalous executions. In this paper, we present a data analytics approach, which is based on the use of provenance data enriched with domain-specific data coupled to a data mining tool. A real bioinformatics workflow was modeled and executed in parallel on top of Amazon clouds. It manipulates complex biological data, which is difficult to monitor like many other genomic workflows. We evaluate the benefits of using domain-specific data and provenance data for user steering while monitoring the execution with detailed filters, steering on specific conditions and performance evaluation. Results show that the provenance database coupled to workflow systems has an unexplored potential for raw data analytics, which may improve the user confidence and reduce overall execution time.

  • Research Article
  • 10.14257/ijmue.2015.10.2.13
Research of the Interconnection of Workflow System Based on Web Service
  • Feb 28, 2015
  • International Journal of Multimedia and Ubiquitous Engineering
  • Gang Yuan + 3 more

In order to achieve the interconnection between different workflow management systems, it was proposed that all the distributed workflow systems would be encapsulated as web services to perform the entire business process collaboratively by the way of processes’ composition in this paper. By analyzing the comparison between the composition of processes and ordinary Web service, we studied interactive control, the parameters required to be passed through the distributed workflow systems, the workflow system service’s interfaces and its packaging. Furthermore we put forward a general method of the workflow systems interactive interfaces’ extension and the way of the workflow service’s encapsulating and invoking. By this approach, it can easily combine the processes or process fragments which deployed on different workflow systems without other agents and components. It also provides support for the interconnection of the workflow systems in distributed environment, and ultimately achieves a coordinated operation between different workflow engines.

  • Book Chapter
  • Cite Count Icon 15
  • 10.1007/978-3-642-34222-6_12
Using Domain-Specific Data to Enhance Scientific Workflow Steering Queries
  • Jan 1, 2012
  • João Carlos De A.R Gonçalves + 4 more

In scientific workflows, provenance data helps scientists in understanding, evaluating and reproducing their results. Provenance data generated at runtime can also support workflow steering mechanisms. Steering facilities for workflows is considered a challenge due to its dynamic demands during execution. To steer, for example, scientists should be able to suspend (or stop) a workflow execution when the approximate solution meets (or deviates) preset criteria. These criteria are commonly evaluated based on provenance data (execution data) and domain-specific data. We claim that the final decision on whether to interfere on the workflow execution may only become feasible when workflows can be steered by scientists using provenance data enriched with domain-specific data. In this paper we propose an approach based on specialized software components, named Data Extractor (DE), to acquire domain-specific data from data files produced during a scientific workflow execution. DE gathers domain-specific data from produced data files and associates it to existing provenance data on the provenance repository. We have evaluated the proposed approach using a real bioinformatics workflow for comparative genomics executed in SciCumulus cloud workflow parallel engine.

  • Research Article
  • Cite Count Icon 5
  • 10.3233/978-1-61499-054-3-91
Provenance for distributed biomedical workflow execution.
  • Jan 1, 2012
  • Studies in health technology and informatics
  • Souley Madougou + 7 more

Scientific research has become very data and compute intensive because of the progress in data acquisition and measurement devices, which is particularly true in Life Sciences. To cope with this deluge of data, scientists use distributed computing and storage infrastructures. The use of such infrastructures introduces by itself new challenges to the scientists in terms of proper and efficient use. Scientific workflow management systems play an important role in facilitating the use of the infrastructure by hiding some of its complexity. Although most scientific workflow management systems are provenance-aware, not all of them come with provenance functionality out of the box. In this paper we describe the improvement and integration of a provenance system into an e-infrastructure for biomedical research based on the MOTEUR workflow management system. The main contributions of the paper are: presenting an OPM implementation using relational database backend for the provenance store, providing an e-infrastructure with a comprehensive provenance system, defining a generic approach to provenance implementation, potentially suitable for other workflow systems and application domains and demonstrating the value of this system based on use cases presenting the provenance data through a user-friendly web interface.

  • Research Article
  • Cite Count Icon 2
  • 10.1002/cpe.1459
Special Issue: 3rd International Workshop on Workflow Management and Applications in Grid Environments (WaGe2008)
  • Jun 24, 2009
  • Concurrency and Computation: Practice and Experience
  • Jinjun Chen + 1 more

This special issue of Concurrency and Computation: Practice and Experience contains selected high-quality papers from the 3rd International Workshop on Workflow Management and Applications in Grid Environments (WaGe2008), which was held on May 25 2008, in Kunming China 1. The WaGe workshop series aims to provide an international forum for the presentation and discussion of research and development trends regarding workflow support in grid environments. WaGe2008 attracted many international attendants, allowing deep discussion and the exchange of ideas and results related to ongoing research among attendants. Following WaGe2007 on August 17, 2007, in Urumqi China, WaGe2008 continues to discuss workflow management in grid environments from different perspectives and areas to tackle different potentials for further research and development. Grid workflow has been under investigation for several years 2-8. In particular, the special issue titled Workflow in Grid Systems in Concurrency and Computation: Practice and Experience was a key step 7. The special issue was edited by Professor Geoffrey C. Fox and Professor Dennis Gannon from Indiana University in U.S.A. A follow-up were the special issues in the same journal for WSGE2006 (1st International Workshop on Workflow Systems in Grid Environments) and WaGe2007 9, 10. This WaGe2008 special issue is another follow-up of the three special issues to further boost the research and development of workflow management and applications in grid environments. Many research and development efforts have been made in the field of workflow management and applications in grid environments such as 2-8, 11-19. More and more people from different areas are trying to facilitate the techniques from their respective areas to tackle tough issues in grid workflow such as resource scheduling, computation reduction, and semantic/knowledge management issues. Accordingly, following the special issue of WaGe2007, this special issue continues to accommodate a range of papers from different perspectives and areas such as service computing and knowledge management to provide some different views and hints for grid workflow research. This special issue contains ten papers based on those that were presented at WaGe2008. They are listed as 20-29. Research problems in these papers have been analysed systematically, and for specific approaches or models, evaluation has been performed to demonstrate their feasibility and advantages. The 10 papers were selected on this basis and also peer reviewed thoroughly. They are summarized below. Paper 20 is related to Pegasus—a grid/scientific workflow management system. The paper provides an extension to Pegasus whereby resource allocation decisions are revised during workflow evaluation so that adaptive processing can be achieved. The experimental evaluation is conducted to demonstrate the feasibility and performance of the proposed algorithm. Wang et al. 21 focused on robustness and reliability in grid/scientific workflow scheduling and execution. From the perspective of agent technology, the paper introduces a model to incorporate trust, which indicates the probability that a service agent will comply with its commitments to improve the predictability and stability of the schedule. The evaluation demonstrates that the proposed model can determine the most robust execution flow efficiently, thus avoiding the need for scheduling every possible execution path in the workflow definition. Kelly et al. 22 proposes a new workflow model based on Lambda Calculus to accommodate the fact that scientific applications often need to execute a set of dependent tasks across multiple computers. Comparison with other workflow, languages, and prototype implementation are presented to demonstrate its better performance than others. Luo et al. 23 views grid/scientific workflow from a knowledge flow perspective. The paper proposes to represent and reason about the similarity of knowledge flow in order to facilitate grid/scientific workflow in semantic e-science applications. A set of representations are given, and valuation is conducted to demonstrate the feasibility and applicability of the proposed ideas and algorithms in e-science applications. Ren et al. 24 proposes to set up a quick service query list rather than using an ontology to find a satisfactory service quickly for executing grid/scientific workflow. Differing from the traditional ontology approaches, the proposed query list can improve query efficiency significantly for grid/scientific workflow execution. The evaluation further demonstrates this conclusion. Goderis et al. 25 addresses the discovery issue in sharing grid/scientific workflow specification and execution. The paper develops benchmarks for the evaluation of discovery tools, drawing on a series of practical exercises. Finally, the paper demonstrates the value of the benchmarks on two tools: one using graph matching and the other relying on text clustering. He et al. 26 proposes a time-computational model named TCMAC for scientific workflow design and scheduling. The paper views grid/scientific workflow from a computation perspective and proposes a mathematical time model to reason about the scheduling of grid/scientific workflow. Theoretical analysis and examples are presented to demonstrate the feasibility and performance of TCMAC. Liu and Zhou 27 proposes an integrated time model for distributed workflow management in grid environments. The paper views grid workflow from the perspective of multi-granularity of time and time zone difference. Correspondingly, the proposed model can accommodate the distributed workflow execution in different time zones in grid environments to provide better execution efficiency using less time. A case study is presented to demonstrate the feasibility and performance of the proposed model. Ren and Chen 28 focuses on optimizing the execution of grid/scientific workflow with the aim of satisfying QoS constraints. The paper proposes a reverse order-based approach to gradually delete QoS constraint violations for building an optimized path to execute a scientific workflow, so that stopping the workflow execution can be avoided. The evaluation is conducted to demonstrate the feasibility and performance of the proposed approach. Pandey et al. 29 presents a real world grid workflow application in the brain science area. A brain imaging analysis grid workflow is proposed covering processing of Image Registration (IR) for Functional Magnetic Resonance Imaging (fMRI) studies on Global Grids. The paper then discusses benchmarking, the application on the Grid'5000 platform, to demonstrate a real-world deployment of the proposed grid workflow and presents extensive performance results.

  • Conference Article
  • Cite Count Icon 35
  • 10.1145/1646468.1646470
Exploring many task computing in scientific workflows
  • Nov 16, 2009
  • Eduardo Ogasawara + 7 more

One of the main advantages of using a scientific workflow management system (SWfMS) to orchestrate data flows among scientific activities is to control and register the whole workflow execution. The execution of activities within a workflow with high performance computing (HPC) presents challenges in SWfMS execution control. Current solutions leave the scheduling to the HPC queue system. Since the workflow execution engine does not run on remote clusters, SWfMS are not aware of the parallel strategy of the workflow execution. Consequently, remote execution control and provenance registry of the parallel activities is very limited from the SWfMS side. This work presents a set of components to be included on the workflow specification of any SWMfS to control parallelization of activities as MTC. In addition, these components can gather provenance data during remote workflow execution. Through these MTC components, the parallelization strategy can be registered and reused, and provenance data can be uniformly queried. We have evaluated our approach by performing parameter sweep parallelization in solving the incompressible 3D Navier-Stokes equations. Experimental results show the performance gains with the additional benefits of distributed provenance support.

  • Research Article
  • Cite Count Icon 34
  • 10.1002/cpe.1870
ProvManager: a provenance management system for scientific workflows
  • Oct 10, 2011
  • Concurrency and Computation: Practice and Experience
  • Anderson Marinho + 6 more

SUMMARYRunning scientific workflows in distributed and heterogeneous environments has been a motivating approach for provenance management, which is loosely coupled to the workflow execution engine. This kind of approach is interesting because it allows both storage and access to provenance data in a homogeneous way, even in an environment where different workflow management systems work together. However, current approaches overload scientists with many ad hoc tasks, such as script adaptations and implementations of extra functionalities to provide provenance independence. This paper proposes ProvManager, a provenance management approach that eases the gathering, storage, and analysis of provenance information in a distributed and heterogeneous environment scenario, without putting the burden of adaptations on the scientist. ProvManager leverages the provenance management at the experiment level by integrating different workflow executions from multiple workflow management systems. Copyright © 2011 John Wiley & Sons, Ltd.

  • Supplementary Content
  • Cite Count Icon 10
  • 10.5167/uzh-73128
An event- and repository-based component framework for workflow system architecture
  • Jan 1, 1999
  • Zurich Open Repository and Archive (University of Zurich)
  • Dimitrios Tombros

During the past decade a new class of systems has emerged, which plays an important role in the support of efficient business process implementation: workflow systems. Despite their proliferation however, workflow systems are still being developed in an ad hoc way without making use of advanced software engineering technologies such as component-based system development and reuse of architecture artifacts.This work proposes a modern approach to workflow system construction. The approach is centered around a domain-specific software architecture metamodel (the REWORK metamodel) and a repository-based composition framework for workflow system construction out of reusable reactive components. The architecture metamodel defines the component and connector abstractions necessary for describing the static and dynamic aspects of a workflow system. The composition framework defines the lifecycle of a workflow system and supports the dynamic extension of a kernel workflow management system with application-specific elements. Appropriately, resulting systems are called REWORK systems.An event- and repository-based style underlies the REWORK framework. Events are the only component integration mechanism used in REWORK systems. Repositories support both system development by storing artifacts which are used for workflow system development and system operation by making explicit the structure of a running REWORK system.The iterative workflow system composition lifecycle proposed in this thesis comprises the following phases: the architecture analysis phase allows the identification and characterization of processing entities which participate in workflow execution; this phase is supported by a classification framework for processing entities in accordance to their integration-related properties. During the architecture definition phase workflow system components are defined and their behavior is tailored to specifications of workflows which are intended to be executed by the resulting system; furthermore, organizational relations and task assignment policies for these components are declaratively defined. The implementation phase is largely automated and consists in the instantiation of the defined components on top of an event-based operational infrastructure.As already mentioned the entire lifecycle is supported by repositories which store the workflow system artifacts. The iterative development comes into the picture once existing workflow systems have to be maintained either by adding new repository artifacts or by modifying existing ones. Thus, we dedicate a part of this thesis to the description of these repositories.

  • Conference Article
  • 10.1109/icist.2013.6747579
DCT sign-based robust image hashing
  • Dec 1, 2013
  • Supakorn Prungsinchai + 2 more

Accompany with the development of information service and rapid expansion of cloud computing, scientific workflow system is facing challenges of growing size of heterogeneous data, complexity of scientific computing and difficulty of task integration. In this paper, a cloud scientific workflow system-CSWf, which based on Hadoop, NoSQL and Web Service technology, is proposed to make an effective integration of data and service resources in the loosely-coupled cloud service environment. To implement CSWf, a simple but effective cloud service workflow modeling language is designed, and then a reliable workflow engine is developed to parse and schedule the workflow processes, also a distributed execution framework is built to encapsulate workflow jobs and execute workflow for CSWf in the cloud environment. CSWf takes benefits of massive data storage capacity and distributed parallel computing power of cloud computing. It accommodates well on the requirement of modeling, scheduling, coordinating and executing workflows on a distributed workflow system. At the end of this paper, an urban regional air pollution workflow with different size input data on the cluster is run, to measure the performance of CSWf. The result shows that CSWf can significantly improve the efficiency of workflow execution.

  • Research Article
  • Cite Count Icon 11
  • 10.1016/j.future.2013.04.019
Characterizing workflow-based activity on a production e-infrastructure using provenance data
  • May 2, 2013
  • Future Generation Computer Systems
  • Souley Madougou + 6 more

Characterizing workflow-based activity on a production e-infrastructure using provenance data

  • Research Article
  • Cite Count Icon 4
  • 10.1002/cpe.3733
Multi‐layered simulations at the heart of workflow enactment on clouds
  • Dec 16, 2015
  • Concurrency and Computation: Practice and Experience
  • Simon Ostermann + 2 more

SummaryScientific workflow systems face new challenges when supporting Cloud computing, as the information on the state of the used infrastructures is much less detailed than before. Thus, organising virtual infrastructures in a way that not only supports the workflow execution but also optimises it for several service level objectives (e.g. maximum energy consumption limit, cost, reliability, availability) become reliant on good Cloud modelling and prediction information. While simulators were successfully aiding research on such workflow management systems, the currently available Cloud related simulation toolkits suffer from several issues (e.g. scalability and narrow scope) that hinder their applicability. To address these issues, this article introduces techniques for unifying two existing simulation toolkits by first analysing the problems with the current simulators, and then by illustrating the problems faced by workflow systems. We use for this purpose the example of the ASKALON environment, a scientific workflow composition and execution tool for cloud and grid environments. We illustrate the advantages of a workflow system with directly integrated simulation back‐end and how the unification of the selected simulators does not affect the overall workflow execution simulation performance. Copyright © 2015 John Wiley & Sons, Ltd.

  • Conference Article
  • Cite Count Icon 35
  • 10.1145/967900.968040
Architectures for a temporal workflow management system
  • Mar 14, 2004
  • Carlo Combi + 1 more

Workflows describe business processes as the coordinated execution of simple activities (tasks) by human or automatic executors (agents). Workflow management systems (WfMS) are software systems supporting the automatic execution of workflows. Most WfMSs rely on database management systems (DBMS) where temporal aspects, which are relevant for the execution of a workflow, are managed explicitly. In this paper we discuss different architectures for a temporal WfMS: then we propose yet another workflow system which novelly manages temporal aspects via a temporal database system, composed by a temporal layer on top of a relational DBMS (Oracle).
\nThe adoption of a temporal database system both benefitted the development of the engine and increased its efficiency by allowing some additional features, as the management of process model evolution and the selection of executing agents via a workload balance over time.

  • Research Article
  • Cite Count Icon 28
  • 10.1371/journal.pone.0309210
Recording provenance of workflow runs with RO-Crate
  • Sep 10, 2024
  • PLOS ONE
  • Simone Leo + 17 more

Recording the provenance of scientific computation results is key to the support of traceability, reproducibility and quality assessment of data products. Several data models have been explored to address this need, providing representations of workflow plans and their executions as well as means of packaging the resulting information for archiving and sharing. However, existing approaches tend to lack interoperable adoption across workflow management systems. In this work we present Workflow Run RO-Crate, an extension of RO-Crate (Research Object Crate) and Schema.org to capture the provenance of the execution of computational workflows at different levels of granularity and bundle together all their associated objects (inputs, outputs, code, etc.). The model is supported by a diverse, open community that runs regular meetings, discussing development, maintenance and adoption aspects. Workflow Run RO-Crate is already implemented by several workflow management systems, allowing interoperable comparisons between workflow runs from heterogeneous systems. We describe the model, its alignment to standards such as W3C PROV, and its implementation in six workflow systems. Finally, we illustrate the application of Workflow Run RO-Crate in two use cases of machine learning in the digital image analysis domain.

Save Icon
Up Arrow
Open/Close
Notes

Save Important notes in documents

Highlight text to save as a note, or write notes directly

You can also access these Documents in Paperpal, our AI writing tool

Powered by our AI Writing Assistant