Checkpoint Interval and System's Overall Quality for Message Logging-Based Rollback and Recovery in Distributed and Embedded Computing

Nianen Chen,Shangping Ren,Yue Yu

doi:10.1109/icess.2009.34

Abstract

In distributed environment, message logging based checkpointing and rollback recovery is a commonly used approach for providing distributed systems with fault tolerance and synchronized global states. Clearly, taking more frequent checkpointing reduces system recovery time in the presence of faults, and hence improves the system availability; however, more frequent checkpointing may also increase the probability for a task to miss its deadlines or prolong its execution time in fault free scenarios. Hence, in distributed and real-time computing, the systempsilas overall quality must be measured by a set of aggregated criteria, such as availability, task execution time, and task deadline miss probability. In this paper, we take into account state synchronization costs in the checkpointing and rollback recovery scheme and quantitatively analyze the relationships between checkpoint intervals and these criteria. Based on the analytical results, we present an algorithm for finding an optimal checkpoint interval that maximizes systempsilas overall quality.

Full Text