Abstract

This paper presents a high availability run-time system----ChaRM-NT, a Checkpoint-based Rollback recovery system for parallel applications on a cluster of computers (COCs) based on Windows NT. ChaRM-NT implements an insert-mode, reduced coordinated checkpointing and rollback recovery (CRR) mechanism. Owing to the above techniques, ChaRM-NT can recover parallel applications from the checkpointing file upon system failures. In addition we have implemented a new coordinated checkpointing algorithm that only requires O(n) control messages where n is the number of participating processes. Independent on message passing environments (MPEs) ChaRM-NT implements a portable single process CRR library. Therefore it is very easy to adapt to different MPEs and it supports PVM and MPI for NT now.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call