Abstract

A classical approach for achieving fault tolerance in distributed systems is based on the incorporation of efficient and fault tolerant procedures for checkpointing and recovery in such systems. We propose two checkpointing procedures, which can be initiated by any process in the system or upon failure of one or more component processes. Our procedures return the most recent and consistent checkpoints for the processes initiating the procedure, and do not interfere with the progress of the distributed system application. Furthermore, our procedures guarantee that a consistent checkpoint will be obtained when they terminate. Examples illustrating the application of the procedures are also provided.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call