SharesSkew: An algorithm to handle skew for joins in MapReduce

Foto N Afrati,Nikos Stasinopoulos,Jeffrey D Ullman,Angelos Vassilakopoulos

doi:10.1016/j.is.2018.06.005

Foto N Afrati, Nikos Stasinopoulos + Show 2 more

Open Access

https://doi.org/10.1016/j.is.2018.06.005

Copy DOI

Abstract

In this paper we offer an algorithm which computes the multiway join efficiently in MapReduce even when the data is skewed. Handling skew is one of the major challenges in query processing and computing joins is both important and costly. When data is huge distributed computational platforms must be used. The algorithm Shares for computing multiway joins in MapReduce has been shown to be efficient in various scenarios. It optimizes on the communication cost which is the amount of data that is transferred from the mappers to the reducers. However it does not handle skew. Our algorithm distributes Heavy Hitter (HH) valued records separately by using an adaptation of the Shares algorithm to achieve minimum communication cost. HH values of an attribute is decided by our algorithm and depends on the sizes of the relations (or the part of the relations with HH) and how these sizes interrelate with each other. Unlike other recent algorithms for computing multiway joins in MapReduce, which put a constraint on the number of reducers used, our algorithm puts a constraint on the size (number of tuples) of each reducer. We argue that this choice results in even distribution of the data to the reducers. Furthermore, we investigate a family of multiway joins for which a simpler variant of our algorithm can handle skew. We offer closed forms for computing the parameters of our algorithm for chain and symmetric joins.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

SharesSkew: An algorithm to handle skew for joins in MapReduce

Abstract

Talk to us

Similar Papers

More From: Information Systems

Lead the way for us

Journal: Information Systems	Publication Date: Jun 14, 2018
Citations: 18

Similar Papers

Optimal Tracking of Distributed Heavy Hitters and Quantiles
Ke Yi ... Qin Zhang
Algorithmica | VOL. 65
Ke Yi, et. al.Ke Yi ... Qin Zhang
21 Oct 2011
Algorithmica | VOL. 65

Optimal tracking of distributed heavy hitters and quantiles
Ke Yi ... Qin Zhang
-
Ke Yi, et. al.Ke Yi ... Qin Zhang
29 Jun 2009
29 Jun 2009

Query optimization in star computer networks
Larry Kerschberg ... S Bing Yao
ACM Transactions on Database Systems | VOL. 7
Larry Kerschberg, et. al.Larry Kerschberg ... S Bing Yao
01 Dec 1982
ACM Transactions on Database Systems | VOL. 7

Communication-efficient algorithms for tracking distributed data streams
Qin Zhang
-
Qin ZhangQin Zhang
23 Dec 2014
23 Dec 2014

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

SharesSkew: An algorithm to handle skew for joins in MapReduce

Abstract

Talk to us

Similar Papers

More From: Information Systems