Abstract

Deep learning (DL) applications are becoming one of the most important applications for HPC and cloud systems. The massive datasets and deep neural networks (DNN) used by DL applications introduce many HPC challenges. Therefore, HPC checkpoint/restart tools are an attractive choice. However, most data-parallel DL training jobs use a naive scheme, called root checkpointing, which is subject to blocking semantics and straggling forward progress. In this work, we apply a multi-level checkpointing tool (SCR-Exa) to distributed DL applications. We examine the performance of two DNN models at scale on Lassen (a leading TOP500 system), while ensuring the DNN's accuracy is maintained after restart from simulated system failures. Our test results show that multi-level checkpointing schemes are able to achieve nearly constant overhead at scale. To the best our knowledge, this study presents the first evaluation to demonstrate strong scalability of a checkpointing scheme for distributed DL without making framework-specific changes.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call