Storage-aware Algorithms for Scheduling of Workflow Ensembles in Clouds

Piotr Bryk,Maciej Malawski,Ewa Deelman,Gideon Juve

doi:10.1007/s10723-015-9355-6

Abstract

This paper focuses on data-intensive workflows and addresses the problem of scheduling workflow ensembles under cost and deadline constraints in Infrastructure as a Service (IaaS) clouds. Previous research in this area ignores file transfers between workflow tasks, which, as we show, often have a large impact on workflow ensemble execution. In this paper we propose and implement a simulation model for handling file transfers between tasks, featuring the ability to dynamically calculate bandwidth and supporting a configurable number of replicas, thus allowing us to simulate various levels of congestion. The resulting model is capable of representing a wide range of storage systems available on clouds: from in-memory caches (such as memcached), to distributed file systems (such as NFS servers) and cloud storage (such as Amazon S3 or Google Cloud Storage). We observe that file transfers may have a significant impact on ensemble execution; for some applications up to 90 % of the execution time is spent on file transfers. Next, we propose and evaluate a novel scheduling algorithm that minimizes the number of transfers by taking advantage of data caching and file locality. We find that for data-intensive applications it performs better than other scheduling algorithms. Additionally, we modify the original scheduling algorithms to effectively operate in environments where file transfers take non-zero time.

Highlights

Today, workflows are frequently used to model largescale distributed scientific applications
We analyze the relative performance of our proposed scheduling algorithms on clouds with different storage system configurations
The experiment consists of 100 simulations (10 budgets and 10 deadlines) for each application ensemble consisting of 50 workflows (5 applications) and each scheduling algorithm variant (7 algorithms)

Summary

Introduction

Workflows are frequently used to model largescale distributed scientific applications. By using scientific workflows multiple researchers can collaborate on designing a single distributed application This is because workflows are arranged as directed acyclic graphs (DAGs), where each node is a standalone task and edges represent dependencies between tasks. Pegasus [21], which is used in a number of scientific domains, e.g. astronomy and bioinformatics, is a system that can execute workflows on desktops, clusters, grids, or clouds. Such execution is a non-trivial task, especially on clouds, where resource provisioning and deprovisioning, cost accounting, and resource setup must be taken into account. Large-scale computations are often composed of several interrelated workflows grouped into ensembles consisting of workflows that have a similar structure, but may differ in their input data, number of tasks, and individual task sizes

Objectives

Results

Conclusion