Data Balance Algorithm Based on Histogram in MapReduce

Wenyi Zhou ,Yong Zhong,Yan Wang

doi:10.1051/jnwpu/20183630480

Abstract

MapReduce model is a typical distributed computing model, which is widely used in large-scale data processing, and its performance depends largely on the data distribution status. As the data content is often unbalanced, coupled with the storage of randomness, so MapReduce model prone to data skew problem in the calculation process. In order to solve this problem, this paper establishes a data histogram for the data block and the whole file through the improved parallel histogram parallelization algorithm based on MapReduce. According to the data block distribution, we can judge the data skew degree of each storage nodes and define the file equilibrium deviation value as the measure of data skew, and then the data balance algorithm is used to reduce the file equilibrium deviation value. The improved MapReduce-based data histogram parallel construction algorithm can adapt to various types of data application scenarios. In the process of building the histogram, the Map side only needs to transmit histogram statistics to the Reduce side without transmitting the contents of the file. The data transfer can be almost negligible. The data balance algorithm based on histogram employs greedy strategy, which can obtain a better approximate solution of the optimal solution of equilibrium distribution. After several experiments, compared with the random block distribution algorithm, the improved algorithm reduce about 40% of the file balance deviation value and achieves a better data balance performance.

Highlights

Reviewing the Big Data Solution Based on Hadoop Ecosystem[ J]
Research on Handling Data Skew in MapReduce[ J]

Summary

Introduction

关键词:直方图; 并行算法; 数据倾斜; 数据块; 数据均衡; 约束优化; 实验设计中图分类号:TP302.8 文献标志码:A 文章编号:1000⁃2758(2018)03⁃0480⁃07 MapReduce 的计算过程中可能出现 2 个方面的性能瓶颈: Map 阶段并行计算时间受到负载重的 Map 任务制约;Reduce 阶段计算时间受到负载重的 Reduce 任务制约。文献[ 9] 中分析了原因:Map 阶段是由于原始数据分布不均衡,导致某一个( 或某一些) Map 任务处理的数据量远多于其他 Map 任务; Reduce 阶段是由于数据内容本身的倾斜和 Hadoop 默认 Hash 分区方法不合理,导致某一个( 或某一些) Reduce 任务的数据处理量远多于其他 Re⁃ duce 任务。基于以上研究现状,本文尝试通过改进的基于 MapReduce 模型的数据直方图并行构造算法完成对 Map 阶段的数据倾斜进行度量和判定的工作,并在此基础上通过数据均衡算法对其存在的数据倾斜问题进行改进和优化。

Results

Conclusion