Abstract

Modern data management systems extensively use parallelism to speed up query processing over massive volumes of data. This trend has inspired a rich line of research on how to formally reason about the parallel complexity of join computation. In this paper, we go beyond joins and study the parallel evaluation of recursive queries. We introduce a novel framework to reason about multi-round evaluation of Datalog programs, which combines implicit predicate restriction with distribution policies to allow expressing a combination of data-parallel and query-parallel evaluation strategies. Using our framework, we reason about key properties of distributed Datalog evaluation, including parallel-correctness of the evaluation strategy, disjointness of the computation effort, and bounds on the number of communication rounds.

Highlights

  • Modern data management systems – such as Spark [27, 33], Hadoop [16, 11], and others [17] – have extensively used parallelism to speed up query processing over massive volumes of data

  • We show that an economic policy can capture several algorithms used for parallel evaluation of recursive and non-recursive queries, including the Hypercube algorithm [13, 4], and the decomposable strategies based on program restrictions [30]

  • To overcome the undecidability of parallel-correctness, we identify a general family of economic policies, called Generalized Hypercube Policies (GHPs), which are always parallel-correct, and further capture several commonly used parallel evaluation strategies

Read more

Summary

Introduction

Modern data management systems – such as Spark [27, 33], Hadoop [16, 11], and others [17] – have extensively used parallelism to speed up query processing over massive volumes of data. To reason about Hypercube-like algorithms, Ameloot et al [6] recently introduced a framework that captures one-round evaluation of joins under different data distributions Their framework implicitly describes a single-round parallel algorithm through a distribution policy, which specifies how the facts in the input relations are distributed among the machines. We show that an economic policy can capture several algorithms used for parallel evaluation of recursive and non-recursive queries, including the Hypercube algorithm [13, 4], and the decomposable strategies based on program restrictions [30]. In this framework we study several properties of economic policies. We ask which Datalog programs admit economic policies that are bounded by one round: we show that such programs are characterized by a syntactic property called pivoting, which was identified by Wolfson and Silberschatz [32] in the context of decomposable programs

Parallel Complexity
Decomposability
Other Parallel Schemes
Systems
Preliminaries
Datalog
Evaluation Semantics
Proof Theoretic Concepts
The Framework
Datalog Evaluation Modulo Policies
Distributed Evaluation Strategy
Parallel-Correctness
Generalized Hypercube Policies
Weakly Pivoting GHPs
Weakly Pivoting Datalog
Bounded and Disjoint Evaluation
Conclusion
A Appendix
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call