The scalability of future Massively Parallel Processing (MPP) systems is being severely challenged by high failure rates. Current centralized Hard Disk Drive (HDD) checkpointing results in overhead of 25% or more at petascale. Since systems become more vulnerable as the node count keeps increasing, novel techniques that enable fast and frequent checkpointing are critical to the future exascale system implementation. In this work, we first introduce one of the emerging nonvolatile memory technologies, Phase-Change Random Access Memory (PCRAM), as a proper candidate of the fast checkpointing device. After a thorough analysis of MPP systems, failure rates and failure sources, we propose a PCRAM-based hybrid local/global checkpointing mechanism which not only provides a faster checkpoint storage, but also boosts the effectiveness of other orthogonal techniques such as incremental checkpointing and background checkpointing. Three variant implementations of the PCRAM-based hybrid checkpointing are designed to be adopted at different stages and to offer a smooth transition from the conventional in-disk checkpointing to the instant in-memory approach. Analyzing the overhead by using a hybrid checkpointing performance model, we show the proposed approach only incurs less than 3% performance overhead on a projected exascale system.
Read full abstract