Generalization of Large-Scale Data Processing in One MapReduce Job for Coarse-Grained Parallelism

Hsiang-Huang Wu,Chien-Min Wang

doi:10.1007/s10766-016-0444-3

Abstract

MapReduce, proposed as a programming model, has been widely adopted in large-scale data processing with the capability of exploiting distributed resources and processing large-scale data. Nevertheless, such success is accompanied by the difficulty of fitting applications into MapReduce. This is because MapReduce is limited to one kind of fine-grained parallelism—processing every input key-value pair independently. In this paper, we intend MapReduce to feature data processing for coarse-grained parallelism inside applications. More specifically, we generalize the applicability of one MapReduce job to let processing a set of input key-value pairs be allowed dependence, whereas we preserve independence while processing all sets. However, the advancement in this generalization brings the intricate problem of how two-stage processing structure, inherent in MapReduce, handles the dependence while processing a set of input key-value pairs. To tackle this problem, we propose the design pattern called two-phase data processing. It expresses the application in two phases not only to match the two-stage processing structure but to exploit the power of MapReduce through the cooperation between the mappers and reducers. To enable MapReduce to exploit coarse-grained parallelism, we present the design methodology to offer advice on granularity of parallelism, evaluation of manipulating the design pattern, and analysis of dependence. Of the two experiments, the first is conducted on the GPS records of public transit to demonstrate how to fuse a Big Data application with its data preprocessing into one MapReduce job. The second leads the expedition to the computer vision application and takes background subtraction, a part of video surveillance, to prove that our generalization broadens the feasibility of MapReduce.

Full Text