Abstract

SummaryIn the big data era, scientific workflow exhibits the characteristics of data intensity and becomes increasingly popular in scientific domains. Efficient scheduling of data‐intensive scientific workflow in a multiple datacenter (DC) environment has been a long‐standing challenge. Most of previous work on data‐intensive scientific workflow scheduling primarily focused on the optimization of reducing the volumes of data transfer between workflow tasks. In this paper, novel scheduling strategies for the execution of data‐intensive scientific workflow in multi‐DC environment are proposed aiming at the optimization of the overall data transfer time. A novel DC selection approach is proposed to minimize the number of DCs having enough storage capacity for the execution of scientific workflow as well as optimized inter‐DC network bandwidth for efficient data transfer between workflow tasks. A k‐means clustering‐based data placement strategy is adopted to intelligently place the initial data of scientific workflow thereby reducing the volume of initial data transfer between different DCs. A multilevel task replication scheduling strategy is invented to reduce the volumes of intermediate data transfer between DCs during the runtime of the scientific workflow. Simulations spanning a broad range of scientific workflow and multi‐DC settings are performed in order to verify the proposed approaches. The numerical results show that our combined scheduling strategy significantly reduces the overall data transfer time and data transfer volume when scientific workflow is scheduled in multi‐DC environment. Copyright © 2015 John Wiley & Sons, Ltd.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call