Abstract

Currently, immense quantities of data cannot be managed by traditional database management systems. Instead, they must be managed by big data solutions using shared nothing architectures. Data warehouse systems are systems that address very large amounts of information. The most prominent data warehouse model is star schema, which consists of a fact table and some number of dimension tables. It is necessary to join the facts and dimensions for query executions on the data warehouse. In shared nothing architecture, all of the required information is not placed on a single node so it is necessary to retrieve information from other nodes, which causes network congestion and low speeds of query execution. To avoid this problem and achieve maximum parallelism, dimensions can be replicated over nodes if they are not too large. However, if there are dimensions with data volumes greater than the capacity of a node or dimensions where the data volume summation exceeds node capacity, the query execution is confronted with serious problems. In big data problems, the amount of data is immense, and thus replicating immense data cannot be considered an appropriate method. In this paper, we propose a method called Chabok, which uses two-phased Map-Reduce to solve the data warehouse problem. In this method, aggregation is performed completely on Mappers, and intermediate results are sent to the Reducer. Chabok does not need data replication for join omission. The proposed method was implemented on Hadoop, and TPC-DS queries were executed for benchmarking. The query execution time on Chabok surpassed prominent big data products for data warehousing.

Highlights

  • Existing information is a valuable asset for many different types of organizations

  • Many organizations consider big data solutions because they cannot manage their data with traditional database management systems [5]; they must seek drastic measures for the design and implementation of new systems according to big data architectures

  • Problem definition the proposed method that uses MapReduce to solve data warehouse problem is investigated. As it was explained in Related works section, there are many proposed methods to solve data warehouse problem for big data

Read more

Summary

Introduction

Existing information is a valuable asset for many different types of organizations. Storing and analysing information can solve many problems within an organization [1]. The results from data analyses help organizations make correct decisions and provide better services for customers. Many organizations consider big data solutions because they cannot manage their data with traditional database management systems [5]; they must seek drastic measures for the design and implementation of new systems according to big data architectures. These organizations must change their architectures from

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call