This paper discusses distributed checkpointing with logging for practical applications running with limited resources. We present a discrete time model evaluating the total expected overhead per event where the number of available checkpoints that each process can hold is finite. The rollback distance is also bound to some finite interval in many actual applications. Therefore, the recovery overhead for the checkpointing scheme is described by using a truncated geometric distribution as the rollback distance distribution. Although it is difficult to analytically derive the optimal checkpoint interval, which minimizes the total expected overhead, substituting other simple probabilistic distributions instead of the truncated geometric distribution enables us to do this explicitly. Numerical examples obtained through simulations are presented to show that we can achieve almost minimized total overhead by using the new models and analyses.
Read full abstract