Convergence-aware optimal checkpointing for exploratory deep learning training jobs

Hongliang Li,Zichen Wang,Hairui Zhao,Meng Zhang,Xiang Li,Haixiao Xu

doi:10.1016/j.future.2024.107597

Abstract

Training Deep Learning (DL) models are becoming more time-consuming, thus interruptions to the training processes are inevitable. We can obtain an optimal checkpointing interval to minimize the fault tolerance overhead for a HPC (High Performance Computing) job with the precondition that the job progress is proportional to its execution time. Unfortunately, it is not the case in DL model training, where a DL training job yields diminishing returns across its lifetime. Meanwhile, training DL models is inherently exploratory, with early termination frequently occurring during model training&developing. It makes the early progress of a DL training job more valuable than the later ones. Even placement of checkpoints would either increase the risks in the early stages or waste resources overprotecting the latter stages. Moreover, in data parallelism, the state-of-the-art quality-driven scheduling strategies allocate more resources for the early stages of a job than the later ones to accelerate the training progress, which further amplifies the issue. In summary, the early stage is more important than the later stages. Allocating more fault-tolerant resources to the early stages is beneficial for the model exploration. Based on the aforementioned conclusion, we present COCI, an approach to compute optimal checkpointing configuration for a exploratory DL training job, minimizing the fault tolerance overhead, including checkpoint cost and recovery cost. We implement COCI based on state-of-the-art iteration-level checkpointing mechanism, as a pluggable module compatible with PyTorch without extra user input. The experimental results show that COCI reduces up to 40.18% fault tolerance overhead compared to existing state-of-the-art DL fault tolerance methods in serial scenario, 60.64% in data parallel scenario.

Full Text