Optimization of Multiple Correlated Queries by Detecting Similar Data Source with Hadoop/Hive

Kavita Bhatnagar,Puneet Kansal,Safdar Tanweer

doi:10.17485/ijst/2017/v10i18/111370

Kavita Bhatnagar, Puneet Kansal + Show 1 more

Open Access

https://doi.org/10.17485/ijst/2017/v10i18/111370

Copy DOI

Abstract

Objectives: Generated new single Hive Query (HiveQL) by finding the similar type of operation and common data source from two or more input query and compare the total execution time of both queries. Methods/Statistical Analysis: Map Reduce concept of Hadoop Hive is utilized in this paper, a new single query is generated from two or more input queries and 3 sample of data generated with size of 2, 5 and 10 GB using free database generation tool DBGEN. TPC-H queries are executed on this data and total execution time of both the queries is compared to see the performance. Findings: As Hive executes single query at a time, and in this research, multiple queries are provided to hive by converting them into single query. This approach results in reduction of operation while executing the query, which further reduce the execution time and improve the performance of Hive. Hive process the structured data of data warehouse system, so by using this approach, the structured data can be process and analyzed in easily and convenient manner. Structured data is used for processing OLAP (Online Analytical Processing) queries so Hive also helps to process OLAP queries. Hive works in conjunction with Hadoop and it process or execute query on data which is stored on Hadoop. So firstly, Hadoop should be running on the system to use Hive query. This research requires huge amount of data for testing, for this sample data is generated using free data generation tool provided by TPC (Transaction Performance Council), DBGEN. TPC also provide the different types of queries for testing the performance query execution tool, so in this research TPC-H queries are utilized. Application/Improvements: By using the concept which is shown in this research, the total execution time of Hive queries can be reduced drastically and performance of Hive can be increased.

Full Text