Abstract

In the era of global-scale services, analytical queries are performed on datasets that span multiple data centers (DCs). Such geo-distributed queries generate a large amount of inter-DC data transfers at run time. Due to the expensive inter-DC bandwidth, various methods have been proposed to reduce the traffic cost in geo-distributed data analytics. However, current methods do not attempt to address the throughput issue in geo-distributed analytics. In this article, we target at characterizing and optimizing a cost-throughput tradeoff problem in geo-distributed data analytics. Our objectives are two-fold: (1) we minimize the inter-DC traffic cost when serving geo-distributed analytics with uncertain query demand, and (2) we maximize the system throughput, in terms of the number of query requests that can be successfully served with guaranteed queuing delay. Specifically, we formulate a stochastic optimization problem that seamlessly combines these two objectives. To solve this problem, we take advantage of Lyapunov optimization techniques to design and analyze a two-timescale online control framework. Without prior knowledge of future query requests, this framework makes online decisions on input data placement and admission control of query requests. Rigorous theoretical analyses show that our framework can achieve a near-optimal solution and maintain system stability and robustness as well. Extensive trace-driven simulation results further demonstrate that our framework is capable of reducing inter-DC traffic cost, improving system throughput, and guaranteeing a maximum delay for each query request.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call