Proficiency Metrics for Failure Prediction in High Performance Computing

Narate Taerat,Nichamon Naksinehaboon,Clayton Chandler,Chokchai Leangsuksun

doi:10.1109/ispa.2010.84

Abstract

The number of failures occurring in large-scale high performance computing (HPC) systems is significantly increasing due to the large number of physical components found on the system. Fault tolerance (FT) mechanisms help parallel applications mitigate the impact of failures. However, using such mechanisms requires additional overhead. As such, failure prediction is needed in order to smartly utilize FT mechanisms. Hence, the proficiency of a failure prediction determines the efficiency of FT mechanism utilization. The proficiency of a failure predictor in HPC is usually designated by well-known error measurements, e.g. MSE, MAD, precision and recall, in which less error infers the greater proficiency. In this manuscript, we propose to view prediction proficiency from another aspect—lost computing time. We then discuss the insufficiency of error measurements as HPC failure prediction proficiency metrics from the aspect of lost computing time, and propose novel metrics that address these issues.

Full Text