Abstract

Indices improve the performance of relational databases, especially on queries that return a small portion of the data (i.e., low-selectivity queries). Star joins are particularly expensive operations that commonly rely on indices for improved performance at scale. The development and support of index-based solutions for Star Joins are still at very early stages. To address this gap, we propose a distributed Bitmap Join Index (dBJI) and a framework-agnostic strategy to solve join predicates in linear time. For empirical analysis, we used common Hadoop technologies (e.g., HBase and Spark) to show that dBJI significantly outperforms full scan approaches by a factor between 59% and 88% in queries with low selectivity from the Star Schema Benchmark (SSB). Thus, distributed indices may significantly enhance low-selectivity query performance even in very large databases.

Highlights

  • The volume of data that is available changed the design and value of decision-making systems on a broad range of fields [1, 2, 3]

  • We propose a strategy that combines distributed indices and a twolayer architecture based on open-source frameworks to accelerate Star Join queries with low selectivity

  • By employing an Access Layer able to perform random access, we propose a distributed Bitmap Join Index that leverages the parallelism provided by the Processing Layer to solve Star Joins (Section 4.2)

Read more

Summary

Introduction

The volume of data that is available changed the design and value of decision-making systems on a broad range of fields [1, 2, 3]. The Bitmap Join Index is composed of bitmap arrays that represent the occurrence of attribute values from dimension tables in the tuples of the fact table [20]. A Bitmap Join Index for an attribute α from the dimension table D is a set of bitmap arrays for every distinct value of α. For every value x of the attribute α, each bitmap itα=x contains one bit for each tuple, indexed by its primary key pkf. Each of these bits represents the occurrence (1) or not (0) of the value x in the corresponding tuple of the fact table. Only tuples 2 and 9 from the fact table should be retrieved via random access

Objectives
Methods
Findings
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call