Abstract
The technology of checkpointing and rollback recovery as an effective method of fault tolerance, has been used widely on the parallel or distributed computer systems. We have presented a nonblocking coordinated checkpointing algorithm for distributed systems, which are differ from the conventional approach of taking first temporary checkpoints and then converting them to permanent ones by processes. The proposed checkpointing algorithm allows processes to take permanent checkpoints directly, without taking temporary checkpoints. The character of the algorithm contributes to its speed of execution. The orphan messages are eliminated by sender processes and the in-transit messages are eliminated by checkpointing interval and retransmission mechanism. While reducing the complexity of control message during gain checkpoints from O(n) to O(n), the algorithm’s controlling messages are reduced to n-1.
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have