A Framework for Managing MapReduce Applications in Dynamic Distributed Environments

Fabrizio Marozzo,Domenico Talia,Paolo Trunfio

doi:10.1109/pdp.2011.41

Abstract

MapReduce is a programming model widely used in data centers for processing large data sets in a highly parallel way. Current MapReduce systems are based on master-slave architectures that do not cope well with dynamic node participation, since they are mostly designed for conventional parallel computing platforms. On the contrary, in Internet-based computing environments, node churn and failures - including master failures - are likely to happen since nodes join and leave the network at an unpredictable rate. The goal of this work is enabling the use of MapReduce in dynamic distributed environments so as to combine the effectiveness of a well-established programming model with the scalability of a large-scale computing infrastructure. This paper presents an adaptive MapReduce framework, called P2P-MapReduce, which exploits a peer-to-peer model to manage intermittent node participation, master failures and job recovery in a decentralized but effective way, so as to provide a more robust MapReduce middleware that can be effectively exploited in Internet-scale dynamic distributed environments.

Full Text