Application MTTFE vs. Platform MTBF: A Fresh Perspective on System Reliability and Application Throughput for Computations at Scale

J T Daly,S E Michalak,L A Pritchett-Sheats

doi:10.1109/ccgrid.2008.103

Abstract

When running on HPC systems characterized by component failure rates high enough to impact productivity, it becomes important to consider the impact of those failures on individual applications. Typically, this is done by assuming that the mean time between failures (MTBF) for hardware and software components on the system is equivalent to the mean time to fatal error (MTTFE) for an application running on that system. In addition, one commonly applies the rule of thumb estimate that application MTTFE scales as the inverse of the number of nodes used to run the application, so that running on half as many nodes increases MTTFE by a factor of two. However, this estimate does not take into account the fact that a non-trivial fraction of failures affect multiple compute nodes, so a single component failure has the potential to cause multiple application fatal errors. In the work that follows, a new model for application MTTFE is derived based on the impact of multi-component failures and their potential to terminate multiple applications.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Application MTTFE vs. Platform MTBF: A Fresh Perspective on System Reliability and Application Throughput for Computations at Scale

Abstract

Talk to us

Similar Papers

Lead the way for us

Similar Papers

A practical mtbf estimate for pcb design considering component and non-component failures
Tongdan Jin ... Peng Wang
-
Tongdan Jin, et. al. Tongdan Jin ... Peng Wang
14 Jun 2006
14 Jun 2006

Enhancing the robustness of interdependent cyber-physical systems by designing the interdependency relationship
Yangming Zhao ... Chunming Qiao
-
Yangming Zhao, et. al.Yangming Zhao ... Chunming Qiao
01 May 2017
01 May 2017

Variations of the Time to Failure (TTF) for Specific Components in Aeronautical Navigation Systems
Alaa Mohammad Alturki ... Ali Muhammad Rushdi
Journal of Engineering Research and Reports | VOL. -
Alaa Mohammad Alturki, et. al.Alaa Mohammad Alturki ... Ali Muhammad Rushdi
10 Feb 2021
Journal of Engineering Research and Reports | VOL. -

Context Generator and Behavior Translator in a Multilayer Architecture for a Modular Development Process of Cyber-Physical Robot Systems
Seung-Hwan Choi ... Jong-Hwan Kim
IEEE Transactions on Industrial Electronics | VOL. 61
Seung-Hwan Choi, et. al.Seung-Hwan Choi ... Jong-Hwan Kim
01 Feb 2014
IEEE Transactions on Industrial Electronics | VOL. 61

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Application MTTFE vs. Platform MTBF: A Fresh Perspective on System Reliability and Application Throughput for Computations at Scale

Abstract

Talk to us

Similar Papers