Applying content management to automated provenance capture

Karen L Schuchardt,Eric Stephan,Tara Gibson,George Chin

doi:10.1002/cpe.1230

Abstract

AbstractWorkflows and data pipelines are becoming increasingly valuable to computational and experimental sciences. These automated systems are capable of generating significantly more data within the same amount of time compared to their manual counterparts. Automatically capturing and recording data provenance and annotation as part of these workflows are critical for data management, verification, and dissemination. We have been prototyping a workflow provenance system, targeted at biological workflows, that extends our content management technologies and other open source tools. We applied this prototype to the provenance challenge to demonstrate an end‐to‐end system that supports dynamic provenance capture, persistent content management, and dynamic searches of both provenance and metadata. We describe our prototype, which extends the Kepler system for the execution environment, the Scientific Annotation Middleware (SAM) content management software for data services, and an existing HTTP‐based query protocol. Our implementation offers several unique capabilities, and through the use of standards, is able to provide access to the provenance record with a variety of commonly available client tools. Copyright © 2007 John Wiley & Sons, Ltd.

Full Text