Abstract

In the era of global-scale services, big data analytical queries are often required to process datasets that span multiple data centers (DCs). In this setting, cross-DC bandwidth is often the scarcest, most volatile, and/or most expensive resource. However, current widely deployed big data analytics frameworks make no attempt to minimize the traffic traversing these links. In this paper, we present P ixida , a scheduler that aims to minimize data movement across resource constrained links. To achieve this, we introduce a new abstraction called S ilo , which is key to modeling P ixida 's scheduling goals as a graph partitioning problem. Furthermore, we show that existing graph partitioning problem formulations do not map to how big data jobs work, causing their solutions to miss opportunities for avoiding data movement. To address this, we formulate a new graph partitioning problem and propose a novel algorithm to solve it. We integrated P ixida in Spark and our experiments show that, when compared to existing schedulers, P ixida achieves a significant traffic reduction of up to ~ 9x on the aforementioned links.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call