A New Big Data Benchmark for OLAP Cube Design Using Data Pre-Aggregation Techniques

Roberto Tardío,Alejandro Maté,Juan Trujillo

doi:10.3390/app10238674

Abstract

In recent years, several new technologies have enabled OLAP processing over Big Data sources. Among these technologies, we highlight those that allow data pre-aggregation because of their demonstrated performance in data querying. This is the case of Apache Kylin, a Hadoop based technology that supports sub-second queries over fact tables with billions of rows combined with ultra high cardinality dimensions. However, taking advantage of data pre-aggregation techniques to designing analytic models for Big Data OLAP is not a trivial task. It requires very advanced knowledge of the underlying technologies and user querying patterns. A wrong design of the OLAP cube alters significantly several key performance metrics, including: (i) the analytic capabilities of the cube (time and ability to provide an answer to a query), (ii) size of the OLAP cube, and (iii) time required to build the OLAP cube. Therefore, in this paper we (i) propose a benchmark to aid Big Data OLAP designers to choose the most suitable cube design for their goals, (ii) we identify and describe the main requirements and trade-offs for effectively designing a Big Data OLAP cube taking advantage of data pre-aggregation techniques, and (iii) we validate our benchmark in a case study.

Highlights

Nowadays there is a large number of technologies that enables effective processing of Big Data, i.e., huge data volumes, from a diversity of data sources and that are increasingly acquired and processed in real-time
We have proposed a new benchmark for Big Data OLAP, aimed at benchmarking
OLAP cubes designed by taking advantage of data pre-aggregation techniques provided by any of the current Big Data OLAP appoaches [2,3,4,5,6,7,8]

Summary

Introduction

Nowadays there is a large number of technologies that enables effective processing of Big Data, i.e., huge data volumes (terabytes), from a diversity of data sources (relational and not relational data) and that are increasingly acquired and processed in real-time. The main BI applications are report generation, dashboarding, and multidimensional views These applications often require very low query latency, from milliseconds to a few seconds of execution in order to retrieve results from the analytic model. This low query latency is necessary to support interactive BI applications, that promote the discovery of insights by decision makers and data analysts. These kind of interactive BI applications are known as On-Line Analytical

Objectives

Methods

Results

Discussion

Conclusion