Understanding Checkpointing Overheads on Massive-Scale Systems: Analysis of the IBM Blue Gene/P System

Rinku Gupta,Harish Naik,Pete Beckman

doi:10.1177/1094342010369118

Abstract

Providing fault tolerance in high-end petascale systems, consisting of millions of hardware components and complex software stacks, is becoming an increasingly challenging task. Checkpointing continues to be the most prevalent technique for providing fault tolerance in such high-end systems. Considerable research has focussed on optimizing checkpointing; however, in practice, checkpointing still involves a high-cost overhead for users. In this paper, we study the checkpointing overhead seen by various applications running on leadership-class machines like the IBM Blue Gene/P at Argonne National Laboratory. In addition to studying popular applications, we design a methodology to help users understand and intelligently choose an optimal checkpointing frequency to reduce the overall checkpointing overhead incurred. In particular, we study the Grid-Based Projector-Augmented Wave application, the Carr-Parrinello Molecular Dynamics application, the Nek5000 computational fluid dynamics application and the Parallel Ocean Program application—and analyze their memory usage and possible checkpointing trends on 65,536 processors of the Blue Gene/P system.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Understanding Checkpointing Overheads on Massive-Scale Systems: Analysis of the IBM Blue Gene/P System

Abstract

Talk to us

Similar Papers

More From: The International Journal of High Performance Computing Applications

Lead the way for us

Journal: The International Journal of High Performance Computing Applications	Publication Date: Jun 3, 2010
Citations: 12

Similar Papers

Analyzing Checkpointing Trends for Applications on the IBM Blue Gene/P System
Harish Gapanati Naik ... Pete Beckman
-
Harish Gapanati Naik, et. al.Harish Gapanati Naik ... Pete Beckman
01 Sep 2009
01 Sep 2009

Overlapping Computations with Communications and I/O Explicitly Using OpenMP Based Heterogeneous Threading Models
Sadaf R Alam ...
-
Sadaf R Alam, et. al.Sadaf R Alam ...
01 Jan 2012
01 Jan 2012

Experimental Assessment of the Practicality of a Fault-Tolerant System
Jai Wug Kim ... Heon Y Yeom
-
Jai Wug Kim, et. al.Jai Wug Kim ... Heon Y Yeom
01 Jan 2007
01 Jan 2007

Towards Development of Risk-based Checkpointing Scheme Via Parametric Bootstrapping
Shunsuke Tokumoto ... Wong Young Yun
-
Shunsuke Tokumoto, et. al.Shunsuke Tokumoto ... Wong Young Yun
01 Nov 2012
01 Nov 2012

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Understanding Checkpointing Overheads on Massive-Scale Systems: Analysis of the IBM Blue Gene/P System

Abstract

Talk to us

Similar Papers

More From: The International Journal of High Performance Computing Applications