Method for accelerating the joining of distributed datasets by a given criterion

Sergey Tumkovskiy,Yevgeniya Tyryshkina

doi:10.31799/1684-8853-2022-5-2-11

Abstract

Introduction: rapidly growing volumes of information pose new challenges to modern data analysis technologies. Currently, based on cost and performance considerations, data processing is usually performed in cluster systems. One of the most common related operations in analytics is the joins of datasets. Join is an extremely expensive operation that is difficult to scale and increase efficiency in distributed databases or systems based on the MapReduce paradigm. Despite the fact that a lot of effort has been put into improving the performance of this operation, often the proposed methods either require fundamental changes in the MapReduce structure, or are aimed at reducing the overhead of the operation, such as balancing the load on the network. Objective: to develop an algorithm to accelerate the integration of data sets in distributed systems. Results: a review of the Apache Spark architecture and the features of distributed computing based on MapReduce is performed, typical methods for combining datasets are analyzed, the main recommendations for optimizing the operation of combining data are presented, an algorithm that allows you to speed up the special case of combining implemented in Apache Spark is presented. This algorithm uses the methods of partitioning and partial transfer of sets to the computing nodes of the cluster, in such a way as to take advantage of the merge and broadcast associations. The experimental data presented demonstrate that the method is all the more effective the larger the volume of input data. So, for 2Tb compressed data, acceleration up to ~37% was obtained in comparison with standard Spark SQL.

Full Text