Abstract

Hive has long been one of the industry-leading systems for Data Warehousing in Big Data contexts, mainly organizing data into databases, tables, partitions and buckets, stored on top of an unstructured distributed file system like HDFS. Some studies were conducted for understanding the ways of optimizing the performance of several storage systems for Big Data Warehousing. However, few of them explore the impact of data organization strategies on query performance, when using Hive as the storage technology for implementing Big Data Warehousing systems. Therefore, this paper evaluates the impact of data partitioning and bucketing in Hive-based systems, testing different data organization strategies and verifying the efficiency of those strategies in query performance. The obtained results demonstrate the advantages of implementing Big Data Warehouses based on denormalized models and the potential benefit of using adequate partitioning strategies. Defining the partitions aligned with the attributes that are frequently used in the conditions/filters of the queries can significantly increase the efficiency of the system in terms of response time. In the more intensive workload benchmarked in this paper, overall decreases of about 40% in processing time were verified. The same is not verified with the use of bucketing strategies, which shows potential benefits in very specific scenarios, suggesting a more restricted use of this functionality, namely in the context of bucketing two tables by the join attribute of these tables.

Highlights

  • One of the fundamental reasons for the notoriety of the Big Data phenomenon is the current extent to which information can be generated and made available [11], mainly due to the constant innovation, transformation, globalization and personalization of the services associated with new business models

  • After the work of [9], showing the advantages of simple partitioning using the attributes more frequently used in the query filters, and considering the work described in [10], this paper extends that previous work and presents the results obtained with: (i) the use of a multiple partitioning strategy; (ii) the use of different bucketing strategies; and (iii) the combination of partitioning and bucketing strategies

  • Despite the results depicted in [9], regarding the advantages of using a fully denormalized table over a dimensional model based on a star schema in Hive, this work extends the comparison between these two data modelling techniques by applying different partitioning and bucketing strategies to a denormalized table and to a star schema

Read more

Summary

Introduction

One of the fundamental reasons for the notoriety of the Big Data phenomenon is the current extent to which information can be generated and made available [11], mainly due to the constant innovation, transformation, globalization and personalization of the services associated with new business models. Current data types and formats are a major problem, since they challenge the fundamentals of DW processing, as these cannot be applied to free text, images, videos or sensor data [18]. Due to this current conceptual, technological and organizational context, the design and implementation of Big Data Warehouses (BDWs) is becoming an important area of study [6, 7, 13, 18, 20]. These repositories substantially differ from traditional DWs, since they must be based on new logical models, more flexible than the relational ones, and new technologies with higher levels of performance, scalability and fault-tolerance [14, 23]

Objectives
Results
Discussion
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call