Abstract

When running on HPC systems characterized by component failure rates high enough to impact productivity, it becomes important to consider the impact of those failures on individual applications. Typically, this is done by assuming that the mean time between failures (MTBF) for hardware and software components on the system is equivalent to the mean time to fatal error (MTTFE) for an application running on that system. In addition, one commonly applies the rule of thumb estimate that application MTTFE scales as the inverse of the number of nodes used to run the application, so that running on half as many nodes increases MTTFE by a factor of two. However, this estimate does not take into account the fact that a non-trivial fraction of failures affect multiple compute nodes, so a single component failure has the potential to cause multiple application fatal errors. In the work that follows, a new model for application MTTFE is derived based on the impact of multi-component failures and their potential to terminate multiple applications.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.