Abstract

APEmille is the third generation of the APE family of supercomputers, but the first one to include built-in support for fault tolerance. The reasons that led to the consideration of reliability issues come in a large part from the experiences with the previous generations. The APE supercomputers are number-crunching engines aimed at solving complex scientific problems. The typical duration of a computation ranges from several days to several weeks. Due to the large number of components built into an APEmille machine, the expected MTTF without fault tolerance may fall into the same range. Thus, a long-running computation could be invalidated by a system failure with high probability, and there would be no guarantee that the results of a successfully terminated program are correct. To improve on this situation the designers of APEmille decided to incorporate fault tolerance into the system. They chose the error removal based approach, composed of error detection, system-level fault diagnosis, system repair, and backward error recovery. This paper presents a failure model of the APEmille components and develops a comprehensive recovery scheme for the whole computer, including even the reliable storage that is used to record the recovery information.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.