Abstract

In this new era of Big Data, there is a growing need to enable scientific workflows to perform computations at a scale far exceeding a single workstation's capabilities. When running such data intensive workflows in the cloud distributed across several physical locations, the execution time and the resource utilization efficiency highly depends on the initial placement and distribution of the input datasets across these multiple virtual machines in the Cloud. In this paper, we propose BDAP (Big DAta Placement strategy), a strategy that improves workflow performance by minimizing data movement across multiple virtual machines. In this work, we 1) formalize the data placement problem in scientific workflows, 2) propose a data placement algorithm that considers both initial input dataset and intermediate datasets obtained during workflow run, and 3) perform extensive experiments in the distributed environment to verify that our proposed strategy provides an effective data placement solution to distribute and place big datasets at the appropriate virtual machines in the Cloud within reasonable time.

Highlights

  • Workflows have been extensively employed in various scientific areas such as bioinformatics, physics, astronomy, ecology, and earthquake science [10]

  • They are usually modeled as directed acyclic graphs (DAGs) such that workflow tasks are represented as graph vertices and the data flows among tasks are represented by graph edges

  • To improve throughput and performance, this type of application can greatly benefit from distributed high performance computing (HPC) infrastructures such as Cloud computing

Read more

Summary

Introduction

Workflows have been extensively employed in various scientific areas such as bioinformatics, physics, astronomy, ecology, and earthquake science [10]. They are usually modeled as directed acyclic graphs (DAGs) such that workflow tasks are represented as graph vertices and the data flows among tasks are represented by graph edges. The direction of edges shows data flows among tasks. A scientific workflow management system (SWFMS) is a system to design and execute scientific workflows (SWF). Scientific workflows are potentially very large and comprise hundreds or thousands of complex tasks and big datasets [3, 6]. Moving huge datasets across workflow tasks increases the execution time of scientific workflows. To improve throughput and performance, this type of application can greatly benefit from distributed high performance computing (HPC) infrastructures such as Cloud computing

Objectives
Methods
Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.