Abstract

The number of failures occurring in large-scale high performance computing (HPC) systems is significantly increasing due to the large number of physical components found on the system. Fault tolerance (FT) mechanisms help parallel applications mitigate the impact of failures. However, using such mechanisms requires additional overhead. As such, failure prediction is needed in order to smartly utilize FT mechanisms. Hence, the proficiency of a failure prediction determines the efficiency of FT mechanism utilization. The proficiency of a failure predictor in HPC is usually designated by well-known error measurements, e.g. MSE, MAD, precision and recall, in which less error infers the greater proficiency. In this manuscript, we propose to view prediction proficiency from another aspect—lost computing time. We then discuss the insufficiency of error measurements as HPC failure prediction proficiency metrics from the aspect of lost computing time, and propose novel metrics that address these issues.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call