Enhancing the performance of distributed big data processing systems using Hadoop and Polybase

Sergii Minukhin,Victor Fedko,Yurii Gnusov

doi:10.15587/1729-4061.2018.139630

Abstract

The approach to improvement of performance of distributed information systems based on sharing technologies of the Hadoop cluster and component of SQL Server PolyBase was considered. It was shown that the relevance of the problem, solved in the research, relates to the need for processing Big Data with different way of representation, in accordance with solving diverse problems of business projects. An analysis of methods and technologies of creation of hybrid data warehouses based on different data of SQL and NoSQL types was performed. It was shown that at present, the most common is the technology of Big Data processing with the use of Hadoop distributed computation environment. The existing technologies of organization and access to the data in the Hadoop cluster with SQL-like DBMS by using connectors were analyzed. The comparative quantitative estimates of using Hive and Sqoop connectors during exporting data to the Hadoop warehouse were presented. An analysis of special features of Big Data processing in the architecture of Hadoop-based distributed cluster computations was carried out. The features of Polybase technology as a component of SQL Server for organizing a bridge between SQL Server and Hadoop data of the SQL and NoSQL types were presented and described. The composition of the model computer plant based on the virtual machine for implementation of joint setting of PolyBase and Hadoop for solving test tasks was described. A methodological toolset for the installation and configuration of Hadoop and PolyBase SQL Server software was developed with consideration of constraints on computing capacities. Queries for using PolyBase and data warehouse Hadoop when processing Big Data were considered. To assess the performance of the system, absolute and relative metrics were proposed. For large volume of test data, the results of the experiments were presented and analyzed, which illustrated an increase in productivity of the distributed information system – query execution time and magnitude of memory capacity of temporary tables, created in this case. A comparative analysis of the studied technology with existing connectors with Hadoop cluster, which showed the advantage of PolyBase over connectors of Sqoop and Hive was performed. The results of the research could be used in the course of scientific and training experiments of organization when implementing the most modern IT-technologies.

Highlights

Modern data processing technologies address the challenges associated with scaling, flexibility of using different tools, access time and data query processing rate
It seems relevant to develop approaches aimed at improving performance of an information system when working with Big Data of different types by creating a bridge between databases, stored in the distributed system Hadoop, and DBMS MS SQL Server
Forcing calculations on the Hadoop cluster is effective if an insignificant part of the entire table gets in a query result. This is due to the fact that when calculations on Hadoop are disabled, SQL Server first copies all data into temporary tables, and performs filtering

Summary

Introduction

IDS typically use common parallel databases that support complex processing, updating and transaction of SQL queries In this case, many companies store data in a distributed form with the possibility to process them in different ways – by intellectual data analysis, real-time SQL-queries, etc. It seems relevant to develop approaches aimed at improving performance of an information system when working with Big Data of different types by creating a bridge between databases, stored in the distributed system Hadoop, and DBMS MS SQL Server. Using the distributed file system (HDFS) in Hadoop cluster, on the one hand, and the SQL-type database, on the other hand, allows creation of hybrid warehouses This approach gives an opportunity to control on one platform large amounts of diverse information, operative and analytical data, at the expense of their distributed storage and processing. This, in turn, requires development of high-quality methodological support for implementation of all stages of installation and configuration of software and hardware platform of the developed and studied system

Literature review and problem statement

Technology of Big Data processing in the distributed system of Hadoop

The aim and objectives of the study

Analysis of performance of SQL and Hadoop connectors

Technology of connector PolyBase SQL Server – Hadoop

Development of methodological support for deployment of Hadoop-PolyBase

Results of queries execution

11. Conclusions

Full Text

Published version (

Free)

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Enhancing the performance of distributed big data processing systems using Hadoop and Polybase

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Eastern-European Journal of Enterprise Technologies

Lead the way for us

Journal: Eastern-European Journal of Enterprise Technologies	Publication Date: Jul 27, 2018
License type: cc-by

Similar Papers

A survey of data partitioning and sampling methods to support big data analysis
Mohammad Sultan Mahmud ... Salman Salloum
Big Data Mining and Analytics | VOL. 3
Mohammad Sultan Mahmud, et. al.Mohammad Sultan Mahmud ... Salman Salloum
01 Jun 2020
Big Data Mining and Analytics | VOL. 3

A modern method to improve efficiency of Hadoop and MapReduce cluster using Software-Defined Networks technology
Emad Soltani Nejad ... Mohammad Reza Majma
-
Emad Soltani Nejad, et. al.Emad Soltani Nejad ... Mohammad Reza Majma
01 May 2017
01 May 2017

Sustainable MapReduce: Optimizing Security and Efficiency in Hadoop Clusters with Lightweight Cryptography-based Key Management
Marwa Khadji ... Salmane Bourekkadi
E3S Web of Conferences | VOL. 412
Marwa Khadji, et. al.Marwa Khadji ... Salmane Bourekkadi
01 Jan 2023
E3S Web of Conferences | VOL. 412

Towards a Multi-agents Model for Automatic Big Data Processing to Support Urban Planning
Fouad Sassite ... Fatimazahra Barramou
-
Fouad Sassite, et. al.Fouad Sassite ... Fatimazahra Barramou
11 Nov 2021
11 Nov 2021

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Enhancing the performance of distributed big data processing systems using Hadoop and Polybase

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Eastern-European Journal of Enterprise Technologies