SummarySelf‐Monitoring, Analysis, and Reporting Technology (SMART) is a technology in hard disk drives to predict impending disk failures for data repair in advance. As the prediction accuracy of SMART is unsatisfactory, recently, machine learning techniques have been explored to improve the prediction accuracy. Those approaches treat disk failure prediction as a binary classification problem and take SMART attributes as features, and some of them achieve satisfactory prediction accuracy. However, there is no uniform metric to measure the financial impact of these methods whose primary objective is to reduce disk failure recovery costs via disk failure prediction. In this article, from a financial impact perspective, we propose a simple, yet practical, metric Mean‐Cost‐To‐Recovery (MCTR) for disk failure prediction in data centers. Specifically, by assigning different weights to mispredicted healthy disks and failed disks, we measure the entire misprediction costs, that is, MCTR. In addition, we argue that the commonly used threshold 0.5 for disk failure prediction is suboptimal because of the fact of data imbalance, that is, failed disks are much fewer than healthy ones. To find the optimal threshold which renders minimal MCTR, we wrap a cost‐minimizing procedure around disk failure prediction and use a threshold‐moving technique for searching. Moreover, to map sample‐level prediction results to disk‐level prediction results, a modified leaky‐bucket algorithm is design to determine the disk health state by considering its multiple sample‐level prediction results. To evaluate the effectiveness of our approach, we conduct extensive experiments using three real‐world datasets. The experimental results show that compared with reactive data protection schemes, we can reduce MCTR by up to 86.9%, and compared with cost‐blind failure predictions, we can reduce MCTR by up to 22.3%.
Read full abstract