Run Data Run! Re-Distributing Data via Piggybacking for Geo-Distributed Data Analytics

Yefei Li,Sanglu Lu,Haiyang Chen,Sheng Zhang,Zhuzhong Qian,Yibo Jin,Mingtao Ji,Wenchao Xi

doi:10.1109/ispa-bdcloud-sustaincom-socialcom48970.2019.00059

Abstract

Effectively analyzing geo-distributed datasets is emerging as a major demand in a cloud-edge system. Previous researches mainly focus on offloading proper data analytical tasks from hot or weak edges to the datacenter (DC) to minimize the response time for current jobs. Since several datasets would be accessed multiple times, we argue that re-distributing data along with task offloading would benefit the forthcoming jobs as well as improve the overall performance, although it may increase the completion time of current job. In order to minimize the overall completion time for a sequence of jobs as well as guarantee the current job response time and WAN usage, we formulate e-bounded geo-distributed data-driven task scheduling problem under the consideration of heterogeneity. Afterwards, we design an online data-driven task scheduling schema runData, which offloads proper tasks and related data via piggybacking to a DC based on delicate calculated probabilities. Through our rigorous theoretical analysis, runData can be proved concentrated on its optimum with high probability. We implement runData based on Spark. Both testbed and simulation results show that runData re-distributes proper data via piggybacking and achieves up to 37% improvement on average response time compared with state-of-the-art task scheduling schemas.

Full Text