Intentional Data Placement Optimization for Distributed Data Warehouses

Billel Arres,Omar Boussaid,Fadila Bentayeb,Nadia Kabachi

doi:10.1109/smc.2015.27

Abstract

Parallel computing is a fundamental technique in the management of large quantities of data as it leverages on the concurrent utilization of multiple computing resources. One of the technologies that made big data analytics popular and accessible to enterprises of all sizes is MapReduce (and its open-source Hadoop implementation). With the ability to automatically parallelize the application on a cluster of commodity hardware, MapReduce allows enterprises to analyze terabytes and pet bytes of data more conveniently than ever. However, the performance gained from Hadoop's features is currently limited by its default block placement policy, which does not take any data characteristics into account. Indeed, the efficiency of many operations can be improved by a careful data placement, including indexing, grouping, aggregation and joins. In this paper, we present a MapReduce data blocks allocation approach to improve MapReduce jobs execution and query performances on multi-nodes clusters, especially Hadoop clusters. Based on k-means clustering method that allows to master the number of clusters through its k parameter, we study the influence of number of clusters on queries execution instead of queries performances with and without data organization. For this, we used well-known, large-scale data analysis benchmark: TPC-H. Our experiments suggest that defining a good data placement on a cluster during the implementation of a data warehouse increase significantly the OLAP cube construction and querying performances.

Full Text