Dynamic cluster strategy for hierarchical rollback‐recovery protocols in MPI HPC applications

Xiaofei Liao,Binsheng Zhang,Yu Zhang,Yi Lin,Long Zheng,Xuanhua Shi,Hai Jin

doi:10.1002/cpe.4173

Abstract

SummaryFault tolerance in parallel computing becomes increasingly important with a significant rise in high‐performance computing systems. Coordinated checkpointing and message logging protocols are commonly used fault tolerance mechanisms for message‐passing applications. However, these mechanisms are insufficient because of their severe drawbacks. Hierarchical rollback‐recovery protocols, combining coordinated checkpointing with message logging, are a better solution. However, such protocols may not obtain the appropriate efficiency because the communication pattern in different stages of applications may vary at runtime. In an effort to improve the efficiency of hierarchical rollback‐recovery protocols, we propose a dynamic cluster strategy to adapt to the runtime variation of communication pattern by using a prediction scheme. Finally, the efficiency and scalability of the dynamic cluster strategy are evaluated using 2 static process partition algorithms on the High‐Performance Linpack benchmark.

Full Text