Towards self‐caring MapReduce: a study of performance penalties under faults

Selvi Kadirvel,José A.B Fortes

doi:10.1002/cpe.3044

Abstract

SummarySelf‐caring IT systems are those that can proactively avoid system failures rather than reactively handle failures after they have occurred. In this paper, we focus on failures in which a MapReduce job is unable to execute within an service‐level agreement based completion time. The existing fault‐tolerance capability provided by MapReduce frameworks such as Hadoop, is simple and the penalty associated with handling faults could potentially lead to excessive job execution times. Our goal in this paper is to bring out the severity of this penalty for different job and framework parameters. We quantitatively evaluate the penalty in execution time associated with node faults using the MRPerf simulator. We then perform an empirical study of penalties on a virtualized testbed consisting of Xen domains, by varying system characteristics along four dimensions: hardware, application, dataset, and fault types. Through simulation and empirical results, we show that job‐completion‐time service‐level agreement violations can be reduced using dynamic resource scaling. Scaling leverages, the elastic properties of a virtualized environment, to mitigate execution time penalties and hence proactively avoids a potential job failure. We show that using resource scaling, performance penalties can be decreased to less than 5% of the no‐fault execution time, at minimal additional cost. Copyright © 2013 John Wiley & Sons, Ltd.

Full Text