Evaluating partitioning and bucketing strategies for Hive-based Big Data Warehousing systems

Eduarda Costa,Carlos Costa,Maribel Yasmina Santos

doi:10.1186/s40537-019-0196-1

Eduarda Costa, Carlos Costa + Show 1 more

Open Access

https://doi.org/10.1186/s40537-019-0196-1

Copy DOI

Journal: Journal of Big Data	Publication Date: May 6, 2019
Citations: 24	License type: open-access

Affiliation: University of Minho

Abstract

Hive has long been one of the industry-leading systems for Data Warehousing in Big Data contexts, mainly organizing data into databases, tables, partitions and buckets, stored on top of an unstructured distributed file system like HDFS. Some studies were conducted for understanding the ways of optimizing the performance of several storage systems for Big Data Warehousing. However, few of them explore the impact of data organization strategies on query performance, when using Hive as the storage technology for implementing Big Data Warehousing systems. Therefore, this paper evaluates the impact of data partitioning and bucketing in Hive-based systems, testing different data organization strategies and verifying the efficiency of those strategies in query performance. The obtained results demonstrate the advantages of implementing Big Data Warehouses based on denormalized models and the potential benefit of using adequate partitioning strategies. Defining the partitions aligned with the attributes that are frequently used in the conditions/filters of the queries can significantly increase the efficiency of the system in terms of response time. In the more intensive workload benchmarked in this paper, overall decreases of about 40% in processing time were verified. The same is not verified with the use of bucketing strategies, which shows potential benefits in very specific scenarios, suggesting a more restricted use of this functionality, namely in the context of bucketing two tables by the join attribute of these tables.

Highlights

One of the fundamental reasons for the notoriety of the Big Data phenomenon is the current extent to which information can be generated and made available [11], mainly due to the constant innovation, transformation, globalization and personalization of the services associated with new business models
After the work of [9], showing the advantages of simple partitioning using the attributes more frequently used in the query filters, and considering the work described in [10], this paper extends that previous work and presents the results obtained with: (i) the use of a multiple partitioning strategy; (ii) the use of different bucketing strategies; and (iii) the combination of partitioning and bucketing strategies
Despite the results depicted in [9], regarding the advantages of using a fully denormalized table over a dimensional model based on a star schema in Hive, this work extends the comparison between these two data modelling techniques by applying different partitioning and bucketing strategies to a denormalized table and to a star schema

Summary

Introduction

One of the fundamental reasons for the notoriety of the Big Data phenomenon is the current extent to which information can be generated and made available [11], mainly due to the constant innovation, transformation, globalization and personalization of the services associated with new business models. Current data types and formats are a major problem, since they challenge the fundamentals of DW processing, as these cannot be applied to free text, images, videos or sensor data [18]. Due to this current conceptual, technological and organizational context, the design and implementation of Big Data Warehouses (BDWs) is becoming an important area of study [6, 7, 13, 18, 20]. These repositories substantially differ from traditional DWs, since they must be based on new logical models, more flexible than the relational ones, and new technologies with higher levels of performance, scalability and fault-tolerance [14, 23]

Objectives

Results

Discussion

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Evaluating partitioning and bucketing strategies for Hive-based Big Data Warehousing systems

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Journal of Big Data

Lead the way for us

Similar Papers

Toward Data Warehouse Modeling in the Context of Big Data
Fatimaez-Zahra Dahaoui ... Mohammed Reda Chbihi Louhdi
-
Fatimaez-Zahra Dahaoui, et. al.Fatimaez-Zahra Dahaoui ... Mohammed Reda Chbihi Louhdi
20 Oct 2020
20 Oct 2020

Simulation of an automotive supply chain using big data
António A.C Vieira ... José A Oliveira
Computers & Industrial Engineering | VOL. 137
António A.C Vieira, et. al.António A.C Vieira ... José A Oliveira
31 Aug 2019
Computers & Industrial Engineering | VOL. 137

Partitioning and Bucketing in Hive-Based Big Data Warehouses
Eduarda Costa ... Carlos Costa
-
Eduarda Costa, et. al.Eduarda Costa ... Carlos Costa
01 Jan 2018
01 Jan 2018

MapReduce research on warehousing of big data
M Pticek ... B Vrdoljak
-
M Pticek, et. al.M Pticek ... B Vrdoljak
01 May 2017
01 May 2017

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Evaluating partitioning and bucketing strategies for Hive-based Big Data Warehousing systems

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Journal of Big Data