Performance Implications of Periodic Checkpointing on Large-Scale Cluster Systems

A.J Oliner,R.K Sahoo,M Gupta,J.E Moreira

doi:10.1109/ipdps.2005.337

Abstract

Large-scale systems like BlueGene/L are susceptible to a number of software and hardware failures that can affect system performance. Periodic application checkpointing is a common technique for mitigating the amount of work lost due to job failures, but its effectiveness under realistic circumstances has not been studied. In this paper, we analyze the system-level performance of periodic application checkpointing using parameters similar to those projected for BlueGene/L systems. Our results reflect simulations on a toroidal interconnect architecture, using a real job log from a machine similar to BlueGene/L, and with a real failure distribution from a large-scale cluster. Our simulation studies investigate the impact of parameters such as checkpoint overhead and checkpoint interval on a number of performance metrics, including bounded slowdown, system utilization, and total work lost. The results suggest that periodic checkpointing may not be an effective way to improve the average bounded slowdown or average system utilization metrics, though it reduces the amount of work lost due to failures. We show that overzealous checkpointing with high overhead can amplify the effects of failures. The study also suggests that new metrics and checkpointing techniques may be required to effectively handle job failures on large-scale machines like BlueGene/L.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Performance Implications of Periodic Checkpointing on Large-Scale Cluster Systems

Abstract

Talk to us

Similar Papers

Lead the way for us

Similar Papers

Minimizing Overheads of Checkpoints in Distributed Stream Processing Systems
Syed Muhammad Abrar Akber ... Hai Jin
-
Syed Muhammad Abrar Akber, et. al.Syed Muhammad Abrar Akber ... Hai Jin
01 Oct 2018
01 Oct 2018

Fault-aware job scheduling for bBueGene/L systems
A.J Oliner ... R.K Sahoo
-
A.J Oliner, et. al.A.J Oliner ... R.K Sahoo
26 Apr 2004
26 Apr 2004

Resource management in enterprise cluster and storage systems
Jianzhe Tai
-
Jianzhe TaiJianzhe Tai
10 May 2021
10 May 2021

Facilitating analysis of Monte Carlo dense matrix inversion algorithm scaling behaviour through simulation
Janko Straßburg ... Vassil N Alexandrov
Journal of Computational Science | VOL. 4
Janko Straßburg, et. al.Janko Straßburg ... Vassil N Alexandrov
01 Feb 2013
Journal of Computational Science | VOL. 4

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Performance Implications of Periodic Checkpointing on Large-Scale Cluster Systems

Abstract

Talk to us

Similar Papers