Latent Sector Errors (LSEs) happen at a significant frequency in the field and can impose a huge risk to data reliability. Disk scrubbing is a background process that reads disks periodically to detect LSEs timely, thus shortening the window of vulnerability to data loss. Nowadays, proactive error prediction, using machine learning techniques, has been proposed to improve storage system reliability by increasing the scrubbing rate for disks with higher error rates. Unfortunately, the majority of works incur non-trivial scrubbing costs and overlook the relationship between complete disk failures and LSEs.In this paper, we attempt to maintain or improve data reliability at reduced scrubbing costs. In particular, we design a novel adaptive approach that enforces a lower scrubbing rate for healthy disks and a higher scrubbing rate for disks which are subject to LSEs. Besides LSEs that are specific to partial disk failures, we also adjust scrubbing rates according to complete disk failure rates, because disks typically develop LSEs before they finally fail. Moreover, a voting-based method that exploits the periodic characteristic of scrubbing is proposed to ensure prediction accuracy. Experimental results on a real-world field dataset have demonstrated the effectiveness of our proposed approach. Specifically, the results show that we can achieve the same level of reliability, in terms of Mean-Time-To-Detection (MTTD), as the traditional fixed-rate scrubbing scheme with almost 49% less scrubbing costs or we can improve the reliability by a factor of 2.4X without extra scrubbing costs. Compared with the state-of-the-art approaches, our method can achieve the same level of reliability with nearly 32% less scrubbing costs.
Read full abstract