Abstract

Software fault tolerance is the task of detecting and recovering from failures that are not handled in the underlying hardware or operating system layers of an application. Software rejuvenation prevents failures by periodically, and gracefully, terminating an application and restarting it at a clean internal state. This paper describes five reusable software components that provide these capabilities. They perform automatic detection and restart of failed processes, checkpointing and recovery of data in memory, replication and synchronization of files, and software rejuvenation. These components, which have been ported to a number of UNIX∗ platforms, can be used in any application with minimal programming effort. The fault tolerance capabilities of several communication products and services in AT&T have been enhanced by incorporating these components. Experience with these products to date indicates that the components provide efficient, economical means to increase the level of fault tolerance in an application.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.