Abstract

The MapReduce programming paradigm is frequently used in order to process and analyse a huge amount of data. This paradigm relies on the ability to apply the same operation in parallel on independent chunks of data. The consequence is that the overall performances greatly depend on the way data are partitioned among the various computation nodes. The default partitioning technique, provided by systems like Hadoop or Spark, basically performs a random subdivision of the input records, without considering the nature and correlation between them. Even if such approach can be appropriate in the simplest case where all the input records have to be always analyzed, it becomes a limit for sophisticated analyses, in which correlations between records can be exploited to preliminarily prune unnecessary computations. In this paper we design a context-based multi-dimensional partitioning technique, called CoPart, which takes care of data correlation in order to determine how records are subdivided between splits (i.e., units of work assigned to a computation node). More specifically, it considers not only the correlation of data w.r.t. contextual attributes, but also the distribution of each contextual dimension in the dataset. We experimentally compare our approach with existing ones, considering both quality criteria and the query execution times.

Highlights

  • The need to process and analyze the overwhelming flow of data, due to the rise of social media, Internet of Things (IoT) and multimedia, has motivated the study and development of parallel data processing systems able to deal with them

  • Perfomance metrics We present a set of quality metrics that may be used for evaluating and comparing different context-based partitioning techniques

  • Q3 is the total margin, so greater is the value, better is the index: in all cases the CoPart technique have values that are an order of magnitude better

Read more

Summary

Introduction

The need to process and analyze the overwhelming flow of data, due to the rise of social media, Internet of Things (IoT) and multimedia, has motivated the study and development of parallel data processing systems able to deal with them. The main underlying assumption is that splits can be processed in parallel to produce partial results, which are combined from time to time, until the final result is obtained. Such an approach has been originally developed for bulk analysis, i.e., analysis that involves all the records, considering that the processing time for each record is approximately similar. This translates into a default partitioning technique that considers

Objectives
Methods
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call