A scalable and generic approach to range joins

Maximilian Reif,Thomas Neumann

doi:10.14778/3551793.3551849

Abstract

Analytical database systems provide great insights into large datasets and are an excellent tool for data exploration and analysis. A central pillar of query processing is the efficient evaluation of equi-joins, typically with linear-time algorithms (e.g. hash joins). However, for many use-cases with location and temporal data, non-equi joins, like range joins, occur in queries. Without optimizations, this typically results in nested loop evaluation with quadratic complexity. This leads to unacceptable query execution times. Different mitigations have been proposed in the past, like partitioning or sorting the data. While these allow for handling certain classes of queries, they tend to be restricted in the kind of queries they can support. And, perhaps even more importantly, they do not play nice with additional equality predicates that typically occur within a query and that have to be considered, too. In this work, we present a kd-tree-based, multi-dimension range join that supports a very wide range of queries, and that can exploit additional equality constraints. This approach allows us to handle large classes of queries very efficiently, with negligible memory overhead, and it is suitable as a general-purpose solution for range queries in database systems. The join algorithm is fully parallel, both during the build and the probe phase, and scales to large problem instances and high core counts. We demonstrate the feasibility of this approach by integrating it into our database system Umbra and performing extensive experiments with both large real world data sets and with synthetic benchmarks used for sensitivity analysis. In our experiments, it outperforms hand-tuned Spark code and all other database systems that we have tested.

Full Text