Efficiently Translating Complex SQL Query to MapReduce Jobflow on Cloud

Zhiang Wu,Aibo Song,Lu Zhang,Junzhou Luo,Jie Cao

doi:10.1109/tcc.2017.2700842

Abstract

MapReduce is a widely-used programming model in cloud environment for parallel processing large-scale data sets. The combination of the high-level language with a SQL-to-MapReduce translator allows programmers to code using SQL-like declarative language, so that each program can afterwards be complied into a MapReduce jobflow automatically. This way is helpful to narrow the gap between non-professional users and cloud platforms, and thus significantly improve the usability of the cloud. Although a number of translators have been developed, the auto-generated MapReduce programs still suffered from extremely inefficiency. In this paper, we present an efficient C ost- A ware SQL-to-MapReduce T ranslator (CAT). CAT has two notable features. First, it defines two intra-SQL correlations: Generalized Job Flow Correlation (GJFC) and Input Correlation (IC), based on which a set of looser merging rules are introduced. Thus, both Top-Down (TD) and Bottom-Up (BU) merging strategies are proposed and integrated into CAT simultaneously. Second, it adopts a cost estimation model for MapReduce jobflows to guide the selection of a more efficient MapReduce jobflows auto-generated by TD and BU merging strategies. Finally, comparative experiments on TPC-H benchmark demonstrate the effectiveness and scalability of CAT.

Full Text