Abstract

Modern systems are prone to error, which calls for fault tolerance mechanisms. Traditional fault tolerance mechanisms (checkpoint mechanism) introduce large overhead, sometimes unacceptable. This paper introduces parallel checkpoint, a high efficient checkpoint mechanism for fault tolerance for multi-threaded programs. By eliminating global barrier, parallelizing threads' checkpoint phase, and overlapping threads' computing phase and checkpoint phase, we can achieve great performance gain (averagely 3.16x) and much better scalability over previous checkpoint mechanism.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call