Scientific Workflow Provenance Architecture for Heterogeneous HPC Environments

Alex Williams,Deepak K Tosh

doi:10.1109/iemcon53756.2021.9623106

Abstract

Provenance in computing systems is the key to establishing data integrity. It provides a historical ledger of data's life cycle through creation, ownership, consumption, and manipulation. With provenance in hand, it is possible to reverse engineer the state of the data that can lead to understanding how it was derived and verify its accuracy. This need for data integrity is extremely critical in scientific workflows to ensure verifiability and repeatability of the derived results. Due to the vast computational power required by scientific workflows, many operate within high performance computing (HPC) environments, where data is consumed and manipulated by a multitude of processes running on highly distributed infrastructure. The current landscape of HPC environments range from on-premise systems to cloud and grid based solutions. While the majority of research in digital provenance has been focused on standalone HPC environments, provenance in a heterogeneous HPC environment remains a challenge. In this paper we propose HyperProvenance, a high level system architecture especially for next generation heterogeneous HPC environments, which aims to increase confidence in workflow result accuracy through secure provenance collection.

Full Text