Abstract

A trend of supporting fault tolerance in distributed computing systems is to incorporate fault tolerance into applications at low cost, in terms of both run-time performance and programming effort required to construct reliable application software. To meet this objective, we have designed and implemented Libra, a software library running on a network of workstations to support reliable distributed computing efficiently. By providing fault tolerance transparency and a simple, easy to use high-level message-passing interface, Libra simplifies the development of reliable distributed applications. Fault tolerance is based on distributed consistent checkpointing and rollback-recovery integrated with a user-level network communication protocol. To address real-life problems, the library tolerates messages losses, and efficiently supports checkpointing and recovery of user files. Performance evaluations show that Libra imposes low run-time overhead, and minimizes communication overhead for taking a consistent distributed checkpoint and catching messages in transit during checkpointing.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.