Micro-checkpointing in fault tolerant runtimes

Polyvios Pratikakis,Pavlos Katsogridakis

doi:10.1145/2597917.2597926

Abstract

Multicore processors are increasingly used in safety-critical applications. On one hand, their increasing chip density causes these processors to be more susceptible to transient faults; on the other hand the existence of many cores offers a straightforward compartmentalization against permanent hardware faults. To tackle the first issue and take advantage of the second, we present FT-BDDT, a fault-tolerant task-parallel runtime system. FT-BDDT extends the BDDT runtime system that implements the OMP-Ss dataflow programming model for spawning and scheduling parallel tasks, in which, similarly to OpenMP 4.0, a dynamic dependence analysis detects conicting tasks and automatically synchronizes them to avoid data races and non-determinism. FT-BDDT recovers from both transient and permanent faults. Transient faults during task execution result in simply re-running the task. To handle transient faults in the runtime system, FT-BDDT uses fine-grain micro-checkpointing of the runtime state, so that a recovery is always possible at the level of rerunning a basic block of code on error. Permanent faults are treated in a similar fashion, by having the master core steal the task checkpoint or the runtime micro-checkpoint and reschedule the task or recover the runtime state, respectively. We evaluate FT-BDDT on several benchmarks under various error conditions, while guiding errors to attain maximum coverage of the runtime code. We find a 9.5% average runtime overhead for checkpointing, a constant small space overhead, and a negligible recovery time per error.

Full Text