Abstract

With the globalization of service, organizations continuously produce large volumes of data that need to be analysed over geo-dispersed locations. Traditionally central approach that moving all data to a single cluster is inefficient or infeasible due to the limitations such as the scarcity of wide-area bandwidth and the low latency requirement of data processing. Processing big data across geo-distributed datacenters continues to gain popularity in recent years. However, managing distributed MapReduce computations across geo-distributed datacenters poses a number of technical challenges: how to allocate data among a selection of geo-distributed datacenters to reduce the communication cost, how to determine the Virtual Machine (VM) provisioning strategy that offers high performance and low cost, and what criteria should be used to select a datacenter as the final reducer for big data analytics jobs. In this paper, these challenges is addressed by balancing bandwidth cost, storage cost, computing cost, migration cost, and latency cost, between the two MapReduce phases across datacenters. We formulate this complex cost optimization problem for data movement, resource provisioning and reducer selection into a joint stochastic integer nonlinear optimization problem by minimizing the five cost factors simultaneously. The Lyapunov framework is integrated into our study and an efficient online algorithm that is able to minimize the long-term time-averaged operation cost is further designed. Theoretical analysis shows that our online algorithm can provide a near optimum solution with a provable gap and can guarantee that the data processing can be completed within pre-defined bounded delays. Experiments on WorldCup98 web site trace validate the theoretical analysis results and demonstrate that our approach is close to the offline-optimum performance and superior to some representative approaches.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call