Abstract

Considering the reliability of the data storage system, it is essential to accurately and timely predict impending failures of hard disk drives (HDDs) so as to prevent data loss and reduce recovery cost. Over the past decades, taking as input the SMART (Self-Monitoring, Analysis and Reporting Technology) attributes, many supervised machine learning based methods have been proposed for HDD failure prediction. However, these methods are conducted on different datasets or different preprocessing treatments and thus lack comparative analysis. To fill this gap, we provide a systematic study in this paper on three key steps of the failure prediction, i.e., feature selection strategies, data preprocessing treatments and classification models. A feature selection strategy is proposed by testing the significance of difference between healthy and failed samples. Data relabeling, together with some other data preprocessing treatments are applied and proven to be effective in the case study. The performance of seven classification models are compared, among which the Random Forest model achieves the best performance with 53.95% failure detection rate (FDR) and 6.0% false alarm rate (FAR). Moreover, the Gini importance of SMART attributes is calculated, where two attributes, SMART 197 and SMART 187 are found closely related to the HDD failures.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call