Tuning small analytics on Big Data: Data partitioning and secondary indexes in the Hadoop ecosystem

Oscar Romero,Victor Herrero,Alberto Abelló,Jaume Ferrarons

doi:10.1016/j.is.2014.09.005

Oscar Romero, Victor Herrero + Show 2 more

Open Access

https://doi.org/10.1016/j.is.2014.09.005

Copy DOI

Journal: Information Systems	Publication Date: Sep 21, 2014
Citations: 16	License type: other-oa

Affiliation: Universitat Politècnica de Catalunya

Abstract

In the recent years the problems of using generic storage (i.e., relational) techniques for very specific applications have been detected and outlined and, as a consequence, some alternatives to Relational DBMSs (e.g., HBase) have bloomed. Most of these alternatives sit on the cloud and benefit from cloud computing, which is nowadays a reality that helps us to save money by eliminating the hardware as well as software fixed costs and just pay per use. On top of this, specific querying frameworks to exploit the brute force in the cloud (e.g., MapReduce) have also been devised. The question arising next tries to clear out if this (rather naive) exploitation of the cloud is an alternative to tuning DBMSs or it still makes sense to consider other options when retrieving data from these settings.In this paper, we study the feasibility of solving OLAP queries with Hadoop (the Apache project implementing MapReduce) while benefiting from secondary indexes and partitioning in HBase. Our main contribution is the comparison of different access plans and the definition of criteria (i.e., cost estimation) to choose among them in terms of consumed resources (namely CPU, bandwidth and I/O).

Full Text