Abstract

High Performance Computing (HPC) has seen prodigious growth over the past few years and fault tolerance solutions have been incorporated into HPC systems. The most commonly used technique for fault tolerance in HPC is checkpointing. Processes achieve fault tolerance by saving local checkpoint periodically during execution. When a failure occurs, the previously saved global checkpoint can be used to restart the computation from an intermediate state. Quasi-Synchronous Checkpointing (QSC) is atractive because it allows finding consistent global checkpoints without an extra message overhead. QSC algorithms are classified into: Strictly Z-Path Free, Z-Path Free and Z-Cycle Free. Z-paths and Z-cycles are undesirable patterns that can give rise to inconsistent system states. QSC algorithms are often evaluated with regard to performance, generally through simulation. However, few works have been designed to validate their correctnes. Our approach validates if a QSC algorithm is correct by modeling its execution into a graph and then verifying over this graph if the algorithm is exempt from undesirable patterns. QSC is based on the Happened-Before Relation (HBR) introduced by Lamport. One main problem linked to the HBR is the combinatorial state explosion. Nevertheless, for a HPC system modeled with a HBR graph, the computational cost for the identification of such patterns becomes prohibitively high. In this paper, we define a set of transformation rules oriented towards the detection of the undesirable patterns over a graph derived from the causal order set abstraction (CAOS) which is equivalent to a HBR graph, but it drastically reduces the statespace of a system.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.