Abstract

Scientific workflows benefit from the cloud computing paradigm, which offers access to virtual resources provisioned on pay-as-you-go and on-demand basis. Minimizing resources costs to meet user’s budget is very important in a cloud environment. Several optimization approaches have been proposed to improve the performance and the cost of data-intensive scientific Workflow Scheduling (DiSWS) in cloud computing. However, in the literature, the majority of the DiSWS approaches focused on the use of heuristic and metaheuristic as an optimization method. Furthermore, the tasks hierarchy in data-intensive scientific workflows has not been extensively explored in the current literature. Specifically, in this paper, a data-intensive scientific workflow is represented as a hierarchy, which specifies hierarchical relations between workflow tasks, and an approach for data-intensive workflow scheduling applications is proposed. In this approach, first, the datasets and workflow tasks are modeled as a conditional probability matrix (CPM). Second, several data transformation and hierarchical clustering are applied to the CPM structure to determine the minimum number of virtual machines needed for the workflow execution. In this approach, the hierarchical clustering is done with respect to the budget imposed by the user. After data transformation and hierarchical clustering, the amount of data transmitted between clusters can be reduced, which can improve cost and makespan of the workflow by optimizing the use of virtual resources and network bandwidth. The performance and cost are analyzed using an extension of Cloudsim simulation tool and compared with existing multi-objective approaches. The results demonstrate that our approach reduces resources cost with respect to the user budgets.

Highlights

  • IN recent years, cloud environments are increasingly used in the scientific field [1]

  • We focus on the following research question: What is the number of virtual machines required for the efficient and transparent execution of a workflow in a cloud environment?

  • Our work aims to reduce the monetary cost of data movements during workflow execution, to improve the use of the network in Cloud environment

Read more

Summary

Introduction

IN recent years, cloud environments are increasingly used in the scientific field [1]. Majority of the Workflow scheduling approaches focus on employing heuristic and meta-heuristic as an optimization method and focusing only on the execution time [7]. Even in these cases, communication among tasks is assumed to take zero time units. Traditional techniques have examined the data sharing of workflows tasks These techniques that investigate the scheduling of scientific workflows tasks have inspired us when developing our approach. We propose a novel approach for workflow scheduling considering the hierarchy of scientific workflows tasks. We focus on the following research question: What is the number of virtual machines required for the efficient and transparent execution of a workflow in a cloud environment?.

Related Works
Problem Statement
Application Model
Execution Model
Data Transfer Model
Proposed Approach
Task Clustering Based on Conditional Probability
Transforming Data
Tasks Distance Measures
Cluster Analysis Method and VMs Number Interval
Hierarchical Clustering Method
Objective Function
Evaluation Methods
Experiment 1
Experiment 2
Experiment 3
VIII. Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call