Chabok: a Map-Reduce based method to solve data warehouse problems

Mohammadhossein Barkhordari,Mahdi Niamanesh

doi:10.1186/s40537-018-0144-5

Mohammadhossein Barkhordari, Mahdi Niamanesh

Open Access

https://doi.org/10.1186/s40537-018-0144-5

Copy DOI

Journal: Journal of Big Data	Publication Date: Oct 26, 2018
Citations: 7	License type: open-access

Affiliation: Stem Cell Technology Research Center

Abstract

Currently, immense quantities of data cannot be managed by traditional database management systems. Instead, they must be managed by big data solutions using shared nothing architectures. Data warehouse systems are systems that address very large amounts of information. The most prominent data warehouse model is star schema, which consists of a fact table and some number of dimension tables. It is necessary to join the facts and dimensions for query executions on the data warehouse. In shared nothing architecture, all of the required information is not placed on a single node so it is necessary to retrieve information from other nodes, which causes network congestion and low speeds of query execution. To avoid this problem and achieve maximum parallelism, dimensions can be replicated over nodes if they are not too large. However, if there are dimensions with data volumes greater than the capacity of a node or dimensions where the data volume summation exceeds node capacity, the query execution is confronted with serious problems. In big data problems, the amount of data is immense, and thus replicating immense data cannot be considered an appropriate method. In this paper, we propose a method called Chabok, which uses two-phased Map-Reduce to solve the data warehouse problem. In this method, aggregation is performed completely on Mappers, and intermediate results are sent to the Reducer. Chabok does not need data replication for join omission. The proposed method was implemented on Hadoop, and TPC-DS queries were executed for benchmarking. The query execution time on Chabok surpassed prominent big data products for data warehousing.

Highlights

Existing information is a valuable asset for many different types of organizations
Many organizations consider big data solutions because they cannot manage their data with traditional database management systems [5]; they must seek drastic measures for the design and implementation of new systems according to big data architectures
Problem definition the proposed method that uses MapReduce to solve data warehouse problem is investigated. As it was explained in Related works section, there are many proposed methods to solve data warehouse problem for big data

Summary

Introduction

Existing information is a valuable asset for many different types of organizations. Storing and analysing information can solve many problems within an organization [1]. The results from data analyses help organizations make correct decisions and provide better services for customers. Many organizations consider big data solutions because they cannot manage their data with traditional database management systems [5]; they must seek drastic measures for the design and implementation of new systems according to big data architectures. These organizations must change their architectures from

Methods

Results

Conclusion