Abstract

SummarySelf‐caring IT systems are those that can proactively avoid system failures rather than reactively handle failures after they have occurred. In this paper, we focus on failures in which a MapReduce job is unable to execute within an service‐level agreement based completion time. The existing fault‐tolerance capability provided by MapReduce frameworks such as Hadoop, is simple and the penalty associated with handling faults could potentially lead to excessive job execution times. Our goal in this paper is to bring out the severity of this penalty for different job and framework parameters. We quantitatively evaluate the penalty in execution time associated with node faults using the MRPerf simulator. We then perform an empirical study of penalties on a virtualized testbed consisting of Xen domains, by varying system characteristics along four dimensions: hardware, application, dataset, and fault types. Through simulation and empirical results, we show that job‐completion‐time service‐level agreement violations can be reduced using dynamic resource scaling. Scaling leverages, the elastic properties of a virtualized environment, to mitigate execution time penalties and hence proactively avoids a potential job failure. We show that using resource scaling, performance penalties can be decreased to less than 5% of the no‐fault execution time, at minimal additional cost. Copyright © 2013 John Wiley & Sons, Ltd.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.