CoPart: a context-based partitioning technique for big data

Alberto Belussi,Damiano Carra,Elisa Quintarelli,Sara Migliorini

doi:10.1186/s40537-021-00410-4

Alberto Belussi, Damiano Carra + Show 2 more

Open Access

https://doi.org/10.1186/s40537-021-00410-4

Copy DOI

Journal: Journal of Big Data	Publication Date: Jan 19, 2021
Citations: 4	License type: open-access

Affiliation: University of Verona

Abstract

The MapReduce programming paradigm is frequently used in order to process and analyse a huge amount of data. This paradigm relies on the ability to apply the same operation in parallel on independent chunks of data. The consequence is that the overall performances greatly depend on the way data are partitioned among the various computation nodes. The default partitioning technique, provided by systems like Hadoop or Spark, basically performs a random subdivision of the input records, without considering the nature and correlation between them. Even if such approach can be appropriate in the simplest case where all the input records have to be always analyzed, it becomes a limit for sophisticated analyses, in which correlations between records can be exploited to preliminarily prune unnecessary computations. In this paper we design a context-based multi-dimensional partitioning technique, called CoPart, which takes care of data correlation in order to determine how records are subdivided between splits (i.e., units of work assigned to a computation node). More specifically, it considers not only the correlation of data w.r.t. contextual attributes, but also the distribution of each contextual dimension in the dataset. We experimentally compare our approach with existing ones, considering both quality criteria and the query execution times.

Highlights

The need to process and analyze the overwhelming flow of data, due to the rise of social media, Internet of Things (IoT) and multimedia, has motivated the study and development of parallel data processing systems able to deal with them
Perfomance metrics We present a set of quality metrics that may be used for evaluating and comparing different context-based partitioning techniques
Q3 is the total margin, so greater is the value, better is the index: in all cases the CoPart technique have values that are an order of magnitude better

Summary

Introduction

The need to process and analyze the overwhelming flow of data, due to the rise of social media, Internet of Things (IoT) and multimedia, has motivated the study and development of parallel data processing systems able to deal with them. The main underlying assumption is that splits can be processed in parallel to produce partial results, which are combined from time to time, until the final result is obtained. Such an approach has been originally developed for bulk analysis, i.e., analysis that involves all the records, considering that the processing time for each record is approximately similar. This translates into a default partitioning technique that considers

Objectives

Methods

Results

Discussion

Conclusion