Abstract

MapReduce is undoubtedly the most popular framework for large-scale processing and analysis of vast data sets in clusters of machines. To facilitate the easier use of MapReduce, SQL-like declarative languages and SQL-to-MapReduce translators have attracted increasing attentions recently. The SQL-to-MapReduce translator can automatically generate the MapReduce jobflow for each SQL query submitted by users, which significantly simplifies the interfacing between users and systems. Although a plethora of translators have been developed, the auto-generated MapReduce programs still suffered from extremely inefficiency. In this paper, we attempt to address this challenge by developing a novel Cost-Aware Translator (CAT). CAT has two notable features. First, it defines two intra-SQL correlations: Generalized Job Flow Correlation (GJFC) and Input Correlation (IC), based on which a set of looser merging rules are introduced. Thus, both Top-Down (TD) and Bottom-Up (BU) merging strategies are proposed and integrated into CAT simultaneously. Second, it adopts a cost estimation model for MapReduce jobflows to guide the selection of a more efficient MapReduce jobflows auto-generated by TD and BU merging strategies. Finally, comparative experiments on TPC-H benchmark demonstrate the effectiveness and scalability of CAT.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.