Abstract

SummaryFault tolerance in parallel computing becomes increasingly important with a significant rise in high‐performance computing systems. Coordinated checkpointing and message logging protocols are commonly used fault tolerance mechanisms for message‐passing applications. However, these mechanisms are insufficient because of their severe drawbacks. Hierarchical rollback‐recovery protocols, combining coordinated checkpointing with message logging, are a better solution. However, such protocols may not obtain the appropriate efficiency because the communication pattern in different stages of applications may vary at runtime. In an effort to improve the efficiency of hierarchical rollback‐recovery protocols, we propose a dynamic cluster strategy to adapt to the runtime variation of communication pattern by using a prediction scheme. Finally, the efficiency and scalability of the dynamic cluster strategy are evaluated using 2 static process partition algorithms on the High‐Performance Linpack benchmark.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call