Abstract

Scientific workflows are increasingly used in High Performance Computing (HPC) environments to manage complex simulation and analyses, often consuming and generating large amounts of data. However, workflow tools have limited support for managing the input, output and intermediate data. The data elements of a workflow are often managed by the user through scripts or other ad-hoc mechanisms. Technology advances for future HPC systems is redefining the memory and storage subsystem by introducing additional tiers to improve the I/O performance of data-intensive applications. These architectural changes introduce additional complexities to managing data for scientific workflows. Thus, we need to manage the scientific workflow data across the tiered storage system on HPC machines. In this paper, we present the design and implementation of MaDaTS (Managing Data on Tiered Storage for Scientific Workflows), a software architecture that manages data for scientific workflows. We introduce Virtual Data Space (VDS), an abstraction of the data in a workflow that hides the complexities of the underlying storage system while allowing users to control data management strategies. We evaluate the data management strategies with real scientific and synthetic workflows, and demonstrate the capabilities of MaDaTS. Our experiments demonstrate the flexibility, performance and scalability gains of MaDaTS as compared to the traditional approach of managing data in scientific workflows.

Highlights

  • Scientific workflows are processing large amounts of data through complex simulation and analysis tasks

  • MaDaTS is built on top of an abstraction called Virtual Data Space (VDS) that hides the complexities of managing data on tiered storage systems

  • VDS is a collection of virtual data objects

Read more

Summary

Summary

Scientific workflows are processing large amounts of data through complex simulation and analysis tasks. The need to minimize I/O costs on generation systems and the evolution of new technologies (NVRAMs, SSDs etc.) is resulting in deeper storage hierarchies on High Performance Computing (HPC) systems. A multi-tiered storage hierarchy introduces complexities in workflow and data management. There is need for simple and flexible data abstractions that can allow users to seamlessly manage workflow data and tasks on HPC systems with multiple storage tiers. MaDaTS (Managing Data on Tiered Storage for Scientific Workflows) provides an API and a command-line tool that allows users to manage their workflows and data on tiered storage (Ghoshal & Ramakrishnan (2017))

MaDaTS Workflow Execution
Data Management Abstractions

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.