Abstract

SummaryMost large‐scale scientific workflows take place in multiple collaborative datacenters for access to community‐wide resources, while adhering to each datacenter's non‐uniform resource limits. However, moving both initial input datasets with predetermined locations and intermediate datasets needing placement decisions across geo‐distributed datacenters hinders efficient execution of large‐scale data‐intensive scientific workflows. Thus, scientific workflow's data and task co‐scheduling deal with situations such as pre‐placed initial input datasets, placement of intermediate datasets and each datacenter's non‐uniform computation and storage constraint, while minimizing the cross‐datacenter data transfer. Since this scheduling problem is known to be NP‐hard, here, we propose a novel approach, based on the multilevel graph coarsening and uncoarsening framework, together with a specialized hybrid genetic algorithm having distinctive graph partition driven features of repair and local improvement, for scheduling data‐intensive scientific workflows in geo‐distributed datacenters and optimizing the cross‐datacenter data transfer volume. Extensive simulations, based on four real‐world workflow traces, show that our algorithm significantly reduces the overall geo‐distributed data transfer and demonstrate its effectiveness.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call