Abstract

The execution times of large-scale parallel applications on modern multi/many-core systems are usually longer than their mean time between failures. Therefore, parallel applications must tolerate hardware failures to ensure that not all computation is lost on machine failures. Checkpointing and rollback recovery are very useful techniques to implement fault-tolerant applications. In parallel applications a checkpointing protocol is required to guarantee that individual checkpoints form a consistent global state. Coordinated approaches are the most popular solution to achieve global checkpointing consistency. However, their main drawback is their poor scalability due to the required runtime coordination. This work presents a new hybrid protocol that combines the detection of valid recovery lines at compile time with a light and asynchronous protocol at runtime to negotiate the closest valid recovery line. Experimental results prove the efficiency and scalability of the proposal.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call