ATrie Group Join: A Parallel Star Group Join and Aggregation for In-Memory Column-Stores

Prajwol Sangat,Christopher Messom,David Taniar

doi:10.1109/tbdata.2020.3004520

Abstract

This article presents a new holistic and efficient approach to big data analysis. We introduce a new parallel algorithm, known as <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">ATrie Group Join (ATGJ)</i> , that integrates join, grouping and aggregation operations to accelerate big data analytical workloads in in-memory column-stores. ATGJ performs a single scan of the fact columns and uses a mixture of data and task parallelism for the optimal use of computing resources. It uses a novel technique to perform group-by and aggregation realising the grouping attributes as a tree shaped deterministic finite automation known as <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">Aggregate Trie</i> or <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">ATrie</i> . ATrie facilitates grouping of attributes and processing of data in tight loops that significantly improves the performance on modern hardware. Unlike other competing algorithms, use of ATrie avoids the creation of multiple data structures with the increasing number of dimension tables and grouping attributes. Also, we demonstrate that ATGJ performs efficiently even when the ATrie becomes bushy. We evaluated the algorithm using Star Schema Benchmark (SSBM) to show that it is significantly faster and scales better than other algorithms for the number of concurrent threads, the number of group-by attributes, the data set size and the query complexity.

Full Text