Evaluation of Level of Confidence and Optimization of Roll-back Recovery with Checkpointing for Real-Time Systems

Dimitar Nikolov,Urban Ingelsson,Virendra Singh,Erik Larsson

doi:10.1016/j.microrel.2014.02.004

Abstract

Increasing soft error rates for semiconductor devices manufactured in later technologies enforce the usage of fault tolerant techniques such as Roll-back Recovery with Checkpointing (RRC). As RRC introduces time overhead that increases the completion (execution) time, time constraints (deadlines) might be violated. This is a drawback for a class of computer systems where the correct operation is defined not only by providing the correct outcome of an operation but also by ensuring that the deadlines are met. These computer systems are referred to as real-time systems (RTSs). In general RTSs are classified as soft and hard RTSs depending on the consequences of violating the deadlines. For soft RTSs, where consequences of violating the deadlines are not very severe, research have focused on optimizing RRC and shown that it is possible to find the optimal number of checkpoints such that the average execution time (AET) is minimal. While minimal AET is important for soft RTSs, it is more important to provide a high probability that deadlines are met for hard RTSs, where consequences of violating the deadlines may be catastrophic. Hence, there is a need of probabilistic guarantees that jobs employing RRC complete before a given deadline. Traditionally, AET analysis have been used for soft RTSs and worst case execution time (WCET) analysis along with schedule feasibility have been used for hard RTSs. In this paper we introduce a reliability metric, Level of Confidence (LoC), which is equally applicable to both soft and hard RTS. LoC is used as a metric to evaluate to what extent a deadline is met. The main contributions of this paper are as follows. First, we present a mathematical framework for the evaluation of LoC when RRC is employed. Second, we provide a proof to verify the correctness of the proposed expression. Third, in the context of hard RTSs, we provide a method to obtain the optimal number of checkpoints that maximizes the LoC. Fourth, in the context of soft RTSs where the maximal LoC may not be needed, but instead some LoC requirement is needed, we present an optimization method for RRC that finds the number of checkpoints that results in the minimal completion time while the minimal completion time satisfies a given LoC requirement. Fifth, we use the proposed framework to evaluate and compare probabilistic guarantees when RRC is optimized towards soft RTSs.

Full Text