A Novel Workflow-Level Data Placement Strategy for Data-Sharing Scientific Cloud Workflows

Futian Wang,Lei Zhang,Yun Yang,Yang Wu,Erzhou Zhu,Xiao Liu,Huikang Yi,Cheng Zhang,Xuejun Li

doi:10.1109/tsc.2016.2625247

Abstract

Cloud computing can provide a more cost-effective way to deploy scientific workflows than traditional distributed computing environments such as cluster and grid. Due to the large size of scientific datasets, data placement plays an important role in scientific cloud workflow systems for improving system performance and reducing data transfer cost. Traditional task-level data placement strategy only considers shared datasets within individual workflows to reduce data transfer cost. However, it is obvious that task-level strategy is not necessarily good enough for the situation of multiple workflows at the workflow level. In this paper, a novel workflow-level data placement model is constructed, which regards multiple workflows as a whole. Then, a two-stage data placement strategy is proposed which first pre-allocates initial datasets to proper datacenters during workflow build-time stage, and then dynamically distributes newly generated datasets to appropriate datacenters during runtime stage. Both stages use an efficient discrete particle swarm optimization algorithm to place flexible-location datasets. Comprehensive experiments demonstrate that our workflow-level data placement strategy can be more cost-effective than its task-level counterpart for data-sharing scientific cloud workflows.

Full Text