Abstract
This paper presents a high availability run-time system----ChaRM-NT, a Checkpoint-based Rollback recovery system for parallel applications on a cluster of computers (COCs) based on Windows NT. ChaRM-NT implements an insert-mode, reduced coordinated checkpointing and rollback recovery (CRR) mechanism. Owing to the above techniques, ChaRM-NT can recover parallel applications from the checkpointing file upon system failures. In addition we have implemented a new coordinated checkpointing algorithm that only requires O(n) control messages where n is the number of participating processes. Independent on message passing environments (MPEs) ChaRM-NT implements a portable single process CRR library. Therefore it is very easy to adapt to different MPEs and it supports PVM and MPI for NT now.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have