Abstract

Considering the reliability of the data storage system, it is essential to accurately and timely predict impending failures of hard disk drives (HDDs) so as to prevent data loss and reduce recovery cost. Over the past decades, taking as input the SMART (Self-Monitoring, Analysis and Reporting Technology) attributes, many supervised machine learning based methods have been proposed for HDD failure prediction. However, these methods are conducted on different datasets or different preprocessing treatments and thus lack comparative analysis. To fill this gap, we provide a systematic study in this paper on three key steps of the failure prediction, i.e., feature selection strategies, data preprocessing treatments and classification models. A feature selection strategy is proposed by testing the significance of difference between healthy and failed samples. Data relabeling, together with some other data preprocessing treatments are applied and proven to be effective in the case study. The performance of seven classification models are compared, among which the Random Forest model achieves the best performance with 53.95% failure detection rate (FDR) and 6.0% false alarm rate (FAR). Moreover, the Gini importance of SMART attributes is calculated, where two attributes, SMART 197 and SMART 187 are found closely related to the HDD failures.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.