Abstract

How should we split data among the nodes of a distributed data warehouse in order to boost performance for a forecasted workload? In this paper, we study the effect of different data partitioning schemes on the overall network cost of pairwise joins. We describe a generally-applicable data distribution framework initially designed for Amazon Redshift, a fully-managed petabyte-scale data warehouse in the cloud. To formalize the problem, we first introduce the Join Multi-Graph , a concise graph-theoretic representation of the workload history of a cluster. We then formulate the "Distribution-Key Recommendation" problem - a novel combinatorial problem on the Join Multi-Graph - and relate it to problems studied in other subfields of computer science. Our theoretical analysis proves that "Distribution-Key Recommendation" is NP-complete and is hard to approximate efficiently. Thus, we propose BaW, a hybrid approach that combines heuristic and exact algorithms to find a good data distribution scheme. Our extensive experimental evaluation on real and synthetic data showcases the efficacy of our method into recommending optimal (or close to optimal) distribution keys, which improve the cluster performance by reducing network cost up to 32x in some real workloads.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call