Abstract

Coordinated checkpointing has low stable storage requirements and simplifies the recovery process by reserving a set of consistent global checkpoints. Unfortunately, most algorithms that were proposed either incurred a high communication overhead or blocked all processes. Then, a coordinated algorithm was presented which was nonblocking and which forced only a subset of all processes to participate in a checkpointing event. This algorithm was shown to create inconsistencies in some situations and new algorithms to take consistent checkpoints were proposed. However, we found that these algorithms can still result in inconsistencies when typical behavior in a distributed environment is considered, such as multiple forced checkpoints and multiple concurrent checkpoint initiations. In this paper we identify the inconsistencies that can occur and present an efficient nonblocking algorithm that collects consistent global checkpoints and avoids some of the pitfalls in distributed nonblocking checkpointing.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call