A Comparison of ORC-Compress Performance with Big Data Workload on Virtualization

Kritwara Rattanaopas,Sureerat Kaewkeerat,Yanapat Chuchuen

doi:10.4028/www.scientific.net/amm.855.153

Kritwara Rattanaopas, Sureerat Kaewkeerat + Show 1 more

Open Access

https://doi.org/10.4028/www.scientific.net/amm.855.153

Copy DOI

Journal: Applied Mechanics and Materials	Publication Date: Oct 1, 2016
Citations: 9	License type: CC BY 4.0

Affiliation: Songkhla Rajabhat University

Abstract

Big Data is widely used in many organizations nowadays. Hive is an open source data warehouse system for managing large data set. It provides a SQL-like interface to Hadoop over Map-Reduce framework. Currently, Big Data solution starts to adopt HiveQL tool to improve execution time of relational information. In this paper, we investigate on an execution time of query processing issues comparing two algorithm of ORC file: ZLIB and SNAPPY. The results show that ZLIB can compress data up to 87% compared to NONE compressing data. It was better than SNAPPY which has space saving 79%. However, the key for reducing execution time is Map-Reduce that were shown by a less query execution time when mapper and data node were equal. For example, all query suites in 6-node(ZLIB/SNAPPY) with 250-million table rows has quite similar execution time comparison to 9-node(ZLIB/SNAPPY) with 350-million table rows.

Highlights

The Huge volume of information is called “Big Data”
We investigated on the performance of query processing in HiveQL, which provides a SQL-like interface on Hadoop system
HiveQL-interface can comply all query suites on workload section into an optimized execution plan of map and reduces jobs shown in Table 2, Table 3 and Table 4

Summary

Introduction

The Huge volume of information is called “Big Data”. The variety of Big Data analytic systems can be largely categorized into two groups based on database types: SQL for relational database and NoSQL for non-relational database. We evaluated the performance of Hadoop-Hive infrastructure on virtualization platform with Big Data solution. Hive infrastructure is based on Hadoop Distributed File System (HDFS) that provides a database query for Big Data. Hive can process query with Map-Reduce technique. It used Map method for filtering and sorting data. Hive is set to bring a new compress file format of database, which is called “Optimized Row Columnar (ORC)” [3]. Considering the benefits of Hive and ORC files, we chose them to improve query processing performance with weather station data in virtualization environment. In Background Review and Related Works, we describe background review and related work including Hadoop, Hive and Optimized Record Columnar (ORC) file. Conclusions section concludes and reveals possibly future works.

Review and Related Work

Methodology

Evaluation

Conclusions