Strategies for storage of checkpointing data using non-dedicated repositories on Grid systems

Raphael Y De Camargo,Renato Cerqueira,Fabio Kon

doi:10.1145/1101499.1101500

Raphael Y De Camargo, Renato Cerqueira + Show 1 more

Open Access

https://doi.org/10.1145/1101499.1101500

Copy DOI

Abstract

Dealing with the large amounts of data generated by long-running parallel applications is one of the most challenging aspects of Grid Computing. Periodic checkpoints might be taken to guarantee application progression, producing even more data. The classical approach is to employ high-throughput checkpoint servers connected to the computational nodes by high speed networks. In the case of Opportunistic Grid Computing, we do not want to be forced to rely on such dedicated hardware. Instead, we want to use the shared Grid nodes to store application data in a distributed fashion.In this work, we evaluate several strategies to store checkpoints on distributed non-dedicated repositories. We consider the tradeoff among computational overhead, storage overhead, and degree of fault-tolerance of these strategies. We compare the use of replication, parity information, and information dispersal (IDA). We used InteGrade, an object-oriented Grid middleware, to implement the storage strategies and perform evaluation experiments.

Full Text