A cost-based storage format selector for materialized results in big data frameworks

Rana Faisal Munir,Wolfgang Lehner,Maik Thiele,Alberto Abelló,Oscar Romero

doi:10.1007/s10619-019-07271-0

Rana Faisal Munir, Wolfgang Lehner + Show 3 more

Open Access

https://doi.org/10.1007/s10619-019-07271-0

Copy DOI

Abstract

Modern big data frameworks (such as Hadoop and Spark) allow multiple users to do large-scale analysis simultaneously, by deploying data-intensive workflows (DIWs). These DIWs of different users share many common tasks (i.e, 50–80%), which can be materialized and reused in future executions. Materializing the output of such common tasks improves the overall processing time of DIWs and also saves computational resources. Current solutions for materialization store data on Distributed File Systems by using a fixed storage format. However, a fixed choice is not the optimal one for every situation. Specifically, different layouts (i.e., horizontal, vertical or hybrid) have a huge impact on execution, according to the access patterns of the subsequent operations. In this paper, we present a cost-based approach that helps deciding the most appropriate storage format in every situation. A generic cost-based framework that selects the best format by considering the three main layouts is presented. Then, we use our framework to instantiate cost models for specific Hadoop storage formats (namely SequenceFile, Avro and Parquet), and test it with two standard benchmark suits. Our solution gives on average 1.33$$\times $$ speedup over fixed SequenceFile, 1.11$$\times $$ speedup over fixed Avro, 1.32$$\times $$ speedup over fixed Parquet, and overall, it provides 1.25$$\times $$ speedup.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Distributed and Parallel Databases	Publication Date: May 8, 2019
Citations: 7	License type: other-oa

R Discovery Prime

R Discovery Prime

A cost-based storage format selector for materialized results in big data frameworks

Abstract

Talk to us

Similar Papers

More From: Distributed and Parallel Databases

Lead the way for us

Similar Papers

The Image Calculator: 10x Faster Image-AI Inference by Replacing JPEG with Self-designing Storage Format
Utku Sirin ... Stratos Idreos
Proceedings of the ACM on Management of Data | VOL. 2
Utku Sirin, et. al.Utku Sirin ... Stratos Idreos
12 Mar 2024
Proceedings of the ACM on Management of Data | VOL. 2

VBSF: a new storage format for SIMD sparse matrix–vector multiplication on modern processors
Yishui Li ... Shengguo Li
The Journal of Supercomputing | VOL. 76
Yishui Li, et. al.Yishui Li ... Shengguo Li
10 Apr 2019
The Journal of Supercomputing | VOL. 76

PAPAYA: A library for performance analysis of SQL-based RDF processing systems
Mohamed Ragab ... Riccardo Tommasini
Semantic Web | VOL. -
Mohamed Ragab, et. al.Mohamed Ragab ... Riccardo Tommasini
05 Apr 2024
Semantic Web | VOL. -

On Realizing a Framework for Self-tuning Mappings
Manuel Wimmer ... Petra Brosch
-
Manuel Wimmer, et. al.Manuel Wimmer ... Petra Brosch
01 Jan 2009
01 Jan 2009

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

A cost-based storage format selector for materialized results in big data frameworks

Abstract

Talk to us

Similar Papers

More From: Distributed and Parallel Databases