PITFALLS IN DISTRIBUTED NONBLOCKING CHECKPOINTING

Weigang Ni,Sibabrata Ray,Susan V Vrbsky

doi:10.1142/s0219265904001027

Abstract

Coordinated checkpointing has low stable storage requirements and simplifies the recovery process by reserving a set of consistent global checkpoints. Unfortunately, most algorithms that were proposed either incurred a high communication overhead or blocked all processes. Then, a coordinated algorithm was presented which was nonblocking and which forced only a subset of all processes to participate in a checkpointing event. This algorithm was shown to create inconsistencies in some situations and new algorithms to take consistent checkpoints were proposed. However, we found that these algorithms can still result in inconsistencies when typical behavior in a distributed environment is considered, such as multiple forced checkpoints and multiple concurrent checkpoint initiations. In this paper we identify the inconsistencies that can occur and present an efficient nonblocking algorithm that collects consistent global checkpoints and avoids some of the pitfalls in distributed nonblocking checkpointing.

Full Text