Massive Data Load on Distributed Database Systems over HBase

Ainhoa Azqueta-Alzuaz,Ricardo Jimenez-Peris,Ivan Brondino,Marta Patino-Martinez

doi:10.1109/ccgrid.2017.124

Ainhoa Azqueta-Alzuaz, Ricardo Jimenez-Peris + Show 2 more

https://doi.org/10.1109/ccgrid.2017.124

Copy DOI

Export

Save

Cite

Publication Date: May 1, 2017

Citations: 15

Abstract
Full-Text
Similar Papers

Abstract

Listen

Big Data has become a pervasive technology to manage the ever-increasing volumes of data. Among Big Data solutions, scalable data stores play an important role, especially, key-value data stores due to their large scalability (thousands of nodes). The typical workflow for Big Data applications include two phases. The first one is to load the data into the data store typically as part of an ETL (Extract-Transform-Load) process. The second one is the processing of the data itself. BigTable and HBase are the preferred key-value solutions based on range-partitioned data stores. However, the loading phase is inefficient and creates a single node bottleneck. In this paper, we identify and quantify this bottleneck and propose a tool for parallel massive data loading that solves satisfactorily the bottleneck enabling all the parallelism and throughput of the underlying key-value data store during the loading phase as well. The proposed solution has been implemented as a tool for parallel massive data loading over HBase, the key-value data store of the Hadoop ecosystem.

Full Text